Spaces:

open-llm-leaderboard
/

blog

Running

App Files Files Community

osanseviero HF staff commited on 9 days ago

Commit

96f6035

•

1 Parent(s): 04acec6

Update src/index.html

Browse files

Files changed (1) hide show

src/index.html +8 -8

src/index.html CHANGED Viewed

@@ -303,8 +303,8 @@
             </div>
-            <p>As we can see, both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with MMLU scores from the Open LLM Leaderboard v1. However, we note that the scores are overall much lower since GPQA is much harder. There is thus quite some room for model to improve – which is great news :)</p>
-            <p>MATH-Lvl5 is, obviously, interesting for people focusing on math capabilities. The results on this benchmark are generally correlated with performance on GSM8K, except for some outliers as we can see on the following figure.</p>
             <div class="main-plot-container">
                 <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
@@ -313,15 +313,15 @@
                 </div>
             </div>
-            <p>The green dots highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. The red dots show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5.</p>
             <p>From our current dive in the outputs and behaviors of models, chat versions of base models sometimes have a considerably lower score than the original models on MATH! This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
-            <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
             <p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
     <h2>What’s next?</h2>
         <p>Much like the first version of the Open LLM Leaderboard pushed a community approach to model development during the past year, we hope that the new version 2 will be a milestone of open and reproducible model evaluations.</p>
-        <p>Because backward compatibility and open knowledge is important, you’ll still be able to find all the previous results archived in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>!</p>
-        <p>Taking a step back to look at the evolution of all the 7400 evaluated models on the Open LLM Leaderboard through time, we can note some much wider trends in the field! For instance we see a strong trend going from larger (red dots) models to smaller (yellow dots) models, while at the same time improving performance.</p>
         <div class="main-plot-container">
             <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
@@ -330,8 +330,8 @@
             </div>
         </div>
-        <p>This is great news for the field as smaller models are much easier to embedded as well as much more energy/memory/compute efficient and we hope to observe a similar pattern of progress in the new version of the leaderboard. Given our harder benchmarks, our starting point is for now much lower (black dots) so let’s see where the field take us in a few months from now :)</p>
-        <p>If you’ve read to this point, thanks a lot, we hope you’ll enjoy this new version of the Open LLM Leaderboard. May the open-source winds push our LLMs boats to sail far away on the sea of deep learning.</p>
 </d-article>

             </div>
+            <p>As we can see, both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with MMLU scores from the Open LLM Leaderboard v1. However, we note that the scores are overall much lower since GPQA is much harder. There is thus quite some room for the model to improve – which is great news :)</p>
+            <p>MATH-Lvl5 is obviously interesting for people focusing on math capabilities. The results on this benchmark are generally correlated with performance on GSM8K, except for some outliers, as we can see in the following figure.</p>
             <div class="main-plot-container">
                 <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
                 </div>
             </div>
+            <p>The green dots highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. The red dots show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5.</p>
             <p>From our current dive in the outputs and behaviors of models, chat versions of base models sometimes have a considerably lower score than the original models on MATH! This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
+            <p>MuSR, our last evaluation, is particularly interesting for long-context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
             <p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
     <h2>What’s next?</h2>
         <p>Much like the first version of the Open LLM Leaderboard pushed a community approach to model development during the past year, we hope that the new version 2 will be a milestone of open and reproducible model evaluations.</p>
+        <p>Because backward compatibility and open knowledge are important, you’ll still be able to find all the previous results archived in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>!</p>
+        <p>Taking a step back to look at the evolution of all the 7400 evaluated models on the Open LLM Leaderboard through time, we can note some much wider trends in the field! For instance, we see a strong trend going from larger (red dots) models to smaller (yellow dots) models while at the same time improving performance.</p>
         <div class="main-plot-container">
             <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
             </div>
         </div>
+        <p>This is great news for the field as smaller models are much easier to be embedded and much more energy/memory/compute efficient, and we hope to observe a similar pattern of progress in the new version of the leaderboard. Given our harder benchmarks, our starting point is much lower (black dots), so let’s see where the field takes us in a few months from now :)</p>
+        <p>If you’ve read to this point, thanks a lot. We hope you’ll enjoy this new version of the Open LLM Leaderboard. May the open-source winds push our LLMs boats to sail far away on the sea of deep learning.</p>
 </d-article>