Spaces:

open-llm-leaderboard
/

blog

Running

App Files Files Community

alozowski commited on 9 days ago

Commit

26ce503

•

1 Parent(s): 74d43dc

build layout changes

Browse files

Files changed (1) hide show

dist/index.html +6 -11

dist/index.html CHANGED Viewed

@@ -49,11 +49,6 @@
 </d-front-matter>
 <d-title>
     <h1 class="l-page" style="text-align: center;">Open-LLM performances are plateauing, let’s make it steep again </h1>
-    <div id="title-plot" class="l-body l-screen">
-        <figure>
-            <img src="assets/images/banner.png" alt="Banner">
-        </figure>
-    </div>
 </d-title>
 <d-byline></d-byline>
 <d-article>
@@ -216,7 +211,7 @@
         <h3>What do the rankings look like?</h3>
-        <p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version, some models appear to have a relatively stable ranking (in bold below): Qwen-2-72B instruct, Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
         <p>We’ve been particularly impressed by Qwen2-72B-Instruct, one step above other models (notably thanks to its performance in math, long range reasoning, and knowledge)</p>
         <p>The current second best model, Llama-3-70B-Instruct, interestingly loses 15 points to its pretrained version counterpart on GPQA, which begs the question whether the particularly extensive instruction fine-tuning done by the Meta team on this model affected some expert/graduate level knowledge.</p>
         <p>Also very interesting is the fact that a new challenger climbed the ranks to reach 3rd place despite its smaller size. With only 13B parameters, Microsoft’s Phi-3-medium-4K-instruct model shows a performance equivalent to models 2 to 4 times its size. It would be very interesting to have more information on the training procedure for Phi or an independant reproduction from an external team with open training recipes/datasets.</p>
@@ -272,7 +267,7 @@
             <div class="main-plot-container">
                 <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
                 <div id="ranking">
-                    <iframe src="assets/scripts/rankings_change.html" title="description", height="800" width="100%", style="border:none;"></iframe>
                 </div>
             </div>
@@ -318,12 +313,12 @@
             <div class="main-plot-container">
                 <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
                 <div id="math">
-                    <iframe src="assets/scripts/math_vs_gsm8k.html" title="description", height="500" width="100%", style="border:none;"></iframe>
                 </div>
             </div>
-            <p>In the green box, we highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. In the red box we show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5. From our current dive in the outputs and behaviors of these models, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).</p>
-            <p>This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
             <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
             <p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
@@ -335,7 +330,7 @@
         <div class="main-plot-container">
             <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
             <div id="timewise">
-                <iframe src="assets/scripts/model_size_vs_perf.html" title="description", height="650" width="100%", style="border:none;"></iframe>
             </div>
         </div>

 </d-front-matter>
 <d-title>
     <h1 class="l-page" style="text-align: center;">Open-LLM performances are plateauing, let’s make it steep again </h1>
 </d-title>
 <d-byline></d-byline>
 <d-article>
         <h3>What do the rankings look like?</h3>
+        <p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version, some models appear to have a relatively stable ranking (in bold below): Qwen-2-72B instruct, Meta’s Llama3-70B instruct, 01-ai’s Yi-1.5-34B chat, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
         <p>We’ve been particularly impressed by Qwen2-72B-Instruct, one step above other models (notably thanks to its performance in math, long range reasoning, and knowledge)</p>
         <p>The current second best model, Llama-3-70B-Instruct, interestingly loses 15 points to its pretrained version counterpart on GPQA, which begs the question whether the particularly extensive instruction fine-tuning done by the Meta team on this model affected some expert/graduate level knowledge.</p>
         <p>Also very interesting is the fact that a new challenger climbed the ranks to reach 3rd place despite its smaller size. With only 13B parameters, Microsoft’s Phi-3-medium-4K-instruct model shows a performance equivalent to models 2 to 4 times its size. It would be very interesting to have more information on the training procedure for Phi or an independant reproduction from an external team with open training recipes/datasets.</p>
             <div class="main-plot-container">
                 <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
                 <div id="ranking">
+                    <iframe src="assets/scripts/rankings_change.html" title="description", height="500" width="100%", style="border:none;"></iframe>
                 </div>
             </div>
             <div class="main-plot-container">
                 <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
                 <div id="math">
+                    <iframe src="assets/scripts/math_vs_gsm8k.html" title="description", height="400" width="100%", style="border:none;"></iframe>
                 </div>
             </div>
+            <p>The green dots highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. The red dots show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5.</p>
+            <p>From our current dive in the outputs and behaviors of models, chat versions of base models sometimes have a considerably lower score than the original models on MATH! This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
             <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
             <p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
         <div class="main-plot-container">
             <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
             <div id="timewise">
+                <iframe src="assets/scripts/model_size_vs_perf.html" title="description", height="450" width="100%", style="border:none;"></iframe>
             </div>
         </div>