osanseviero HF staff commited on
Commit
04acec6
1 Parent(s): 863f8f5

Update src/index.html

Browse files
Files changed (1) hide show
  1. src/index.html +10 -10
src/index.html CHANGED
@@ -203,13 +203,13 @@
203
 
204
  <h2>New leaderboard, new results!</h2>
205
  <p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
206
- <aside>As the cluster has been extremely full, you can expect models to keep on appearing in the next days! </aside>
207
 
208
  <h3>What do the rankings look like?</h3>
209
 
210
- <p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version, some models appear to have a relatively stable ranking (in bold below): Qwen-2-72B instruct, Meta’s Llama3-70B instruct, 01-ai’s Yi-1.5-34B chat, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
211
- <p>We’ve been particularly impressed by Qwen2-72B-Instruct, one step above other models (notably thanks to its performance in math, long range reasoning, and knowledge)</p>
212
- <p>The current second best model, Llama-3-70B-Instruct, interestingly loses 15 points to its pretrained version counterpart on GPQA, which begs the question whether the particularly extensive instruction fine-tuning done by the Meta team on this model affected some expert/graduate level knowledge.</p>
213
  <p>Also very interesting is the fact that a new challenger climbed the ranks to reach 3rd place despite its smaller size. With only 13B parameters, Microsoft’s Phi-3-medium-4K-instruct model shows a performance equivalent to models 2 to 4 times its size. It would be very interesting to have more information on the training procedure for Phi or an independant reproduction from an external team with open training recipes/datasets.</p>
214
 
215
  <table>
@@ -267,12 +267,12 @@
267
  </div>
268
  </div>
269
 
270
- <p>Let’s finish with some food for thoughts and advices from the maintainer’s team.</p>
271
 
272
- <h3>Which evaluations should you pay most attention to?</h3>
273
- <p>Depending on your practical use case, you should probably focus on various aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you might be more interested in specific capabilities.</p>
274
 
275
- <p>In particular, we observed that our different evaluations results are not always correlated with one another as illustrated on this correlation matrice:</p>
276
 
277
  <div class="main-plot-container">
278
  <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
@@ -281,8 +281,8 @@
281
  </div>
282
  </div>
283
 
284
- <p>As you can see, MMLU-Pro and BBH are rather well correlated. As it’s been also noted by other teams, these benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSyss chatbot arena).</p>
285
- <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
286
 
287
  <div class="main-plot-container">
288
  <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
 
203
 
204
  <h2>New leaderboard, new results!</h2>
205
  <p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
206
+ <aside>As the cluster has been busy, you can expect models to keep on appearing in the next few days!</aside>
207
 
208
  <h3>What do the rankings look like?</h3>
209
 
210
+ <p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard and comparing them with this updated version, some models appear to have a relatively stable ranking (in bold below): Qwen-2-72B instruct, Meta’s Llama3-70B instruct, 01-ai’s Yi-1.5-34B chat, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
211
+ <p>We’ve been particularly impressed by Qwen2-72B-Instruct, one step above other models (notably thanks to its performance in math, long-range reasoning, and knowledge).</p>
212
+ <p>The current second-best model, Llama-3-70B-Instruct, interestingly loses 15 points to its pretrained version counterpart on GPQA, which begs the question of whether the particularly extensive instruction fine-tuning done by the Meta team on this model affected some expert/graduate-level knowledge.</p>
213
  <p>Also very interesting is the fact that a new challenger climbed the ranks to reach 3rd place despite its smaller size. With only 13B parameters, Microsoft’s Phi-3-medium-4K-instruct model shows a performance equivalent to models 2 to 4 times its size. It would be very interesting to have more information on the training procedure for Phi or an independant reproduction from an external team with open training recipes/datasets.</p>
214
 
215
  <table>
 
267
  </div>
268
  </div>
269
 
270
+ <p>Let’s finish with some food for thoughts and advice from the maintainer’s team.</p>
271
 
272
+ <h3>Which evaluations should you pay the most attention to?</h3>
273
+ <p>Depending on your practical use case, you should focus on various aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you might be more interested in specific capabilities.</p>
274
 
275
+ <p>In particular, we observed that our different evaluation results are not always correlated with one another as illustrated on this correlation matrix:</p>
276
 
277
  <div class="main-plot-container">
278
  <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
 
281
  </div>
282
  </div>
283
 
284
+ <p>As you can see, MMLU-Pro and BBH are rather well correlated. As other teams have noted, these benchmarks are also quite correlated with human preference (for instance, they tend to align with human judgment in LMSys��s chatbot arena).</p>
285
+ <p>Another of our benchmarks, IFEval, targets chat capabilities. It investigates whether models can follow precise instructions. However, the format used in this benchmark tends to favor chat and instruction-tuned models, with pretrained models having a harder time reaching high performances.</p>
286
 
287
  <div class="main-plot-container">
288
  <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>