alozowski commited on
Commit
26ce503
1 Parent(s): 74d43dc

build layout changes

Browse files
Files changed (1) hide show
  1. dist/index.html +6 -11
dist/index.html CHANGED
@@ -49,11 +49,6 @@
49
  </d-front-matter>
50
  <d-title>
51
  <h1 class="l-page" style="text-align: center;">Open-LLM performances are plateauing, let’s make it steep again </h1>
52
- <div id="title-plot" class="l-body l-screen">
53
- <figure>
54
- <img src="assets/images/banner.png" alt="Banner">
55
- </figure>
56
- </div>
57
  </d-title>
58
  <d-byline></d-byline>
59
  <d-article>
@@ -216,7 +211,7 @@
216
 
217
  <h3>What do the rankings look like?</h3>
218
 
219
- <p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version, some models appear to have a relatively stable ranking (in bold below): Qwen-2-72B instruct, Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
220
  <p>We’ve been particularly impressed by Qwen2-72B-Instruct, one step above other models (notably thanks to its performance in math, long range reasoning, and knowledge)</p>
221
  <p>The current second best model, Llama-3-70B-Instruct, interestingly loses 15 points to its pretrained version counterpart on GPQA, which begs the question whether the particularly extensive instruction fine-tuning done by the Meta team on this model affected some expert/graduate level knowledge.</p>
222
  <p>Also very interesting is the fact that a new challenger climbed the ranks to reach 3rd place despite its smaller size. With only 13B parameters, Microsoft’s Phi-3-medium-4K-instruct model shows a performance equivalent to models 2 to 4 times its size. It would be very interesting to have more information on the training procedure for Phi or an independant reproduction from an external team with open training recipes/datasets.</p>
@@ -272,7 +267,7 @@
272
  <div class="main-plot-container">
273
  <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
274
  <div id="ranking">
275
- <iframe src="assets/scripts/rankings_change.html" title="description", height="800" width="100%", style="border:none;"></iframe>
276
  </div>
277
  </div>
278
 
@@ -318,12 +313,12 @@
318
  <div class="main-plot-container">
319
  <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
320
  <div id="math">
321
- <iframe src="assets/scripts/math_vs_gsm8k.html" title="description", height="500" width="100%", style="border:none;"></iframe>
322
  </div>
323
  </div>
324
 
325
- <p>In the green box, we highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. In the red box we show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5. From our current dive in the outputs and behaviors of these models, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).</p>
326
- <p>This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
327
  <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
328
  <p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
329
 
@@ -335,7 +330,7 @@
335
  <div class="main-plot-container">
336
  <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
337
  <div id="timewise">
338
- <iframe src="assets/scripts/model_size_vs_perf.html" title="description", height="650" width="100%", style="border:none;"></iframe>
339
  </div>
340
  </div>
341
 
 
49
  </d-front-matter>
50
  <d-title>
51
  <h1 class="l-page" style="text-align: center;">Open-LLM performances are plateauing, let’s make it steep again </h1>
 
 
 
 
 
52
  </d-title>
53
  <d-byline></d-byline>
54
  <d-article>
 
211
 
212
  <h3>What do the rankings look like?</h3>
213
 
214
+ <p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version, some models appear to have a relatively stable ranking (in bold below): Qwen-2-72B instruct, Meta’s Llama3-70B instruct, 01-ai’s Yi-1.5-34B chat, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
215
  <p>We’ve been particularly impressed by Qwen2-72B-Instruct, one step above other models (notably thanks to its performance in math, long range reasoning, and knowledge)</p>
216
  <p>The current second best model, Llama-3-70B-Instruct, interestingly loses 15 points to its pretrained version counterpart on GPQA, which begs the question whether the particularly extensive instruction fine-tuning done by the Meta team on this model affected some expert/graduate level knowledge.</p>
217
  <p>Also very interesting is the fact that a new challenger climbed the ranks to reach 3rd place despite its smaller size. With only 13B parameters, Microsoft’s Phi-3-medium-4K-instruct model shows a performance equivalent to models 2 to 4 times its size. It would be very interesting to have more information on the training procedure for Phi or an independant reproduction from an external team with open training recipes/datasets.</p>
 
267
  <div class="main-plot-container">
268
  <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
269
  <div id="ranking">
270
+ <iframe src="assets/scripts/rankings_change.html" title="description", height="500" width="100%", style="border:none;"></iframe>
271
  </div>
272
  </div>
273
 
 
313
  <div class="main-plot-container">
314
  <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
315
  <div id="math">
316
+ <iframe src="assets/scripts/math_vs_gsm8k.html" title="description", height="400" width="100%", style="border:none;"></iframe>
317
  </div>
318
  </div>
319
 
320
+ <p>The green dots highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. The red dots show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5.</p>
321
+ <p>From our current dive in the outputs and behaviors of models, chat versions of base models sometimes have a considerably lower score than the original models on MATH! This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
322
  <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
323
  <p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
324
 
 
330
  <div class="main-plot-container">
331
  <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
332
  <div id="timewise">
333
+ <iframe src="assets/scripts/model_size_vs_perf.html" title="description", height="450" width="100%", style="border:none;"></iframe>
334
  </div>
335
  </div>
336