Clémentine commited on
Commit
f3ab0dd
1 Parent(s): b76606d

added thom's edits

Browse files
Files changed (2) hide show
  1. dist/index.html +39 -32
  2. src/index.html +39 -32
dist/index.html CHANGED
@@ -191,28 +191,31 @@
191
  </li>
192
  </ul>
193
  <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
194
- <p>In this list, you’ll find LLMs from model creators who spent time, care, and a lot of compute on creating and delivering new cool models, from Meta and Google to Cohere or Mistral, as well as collectives like EleutherAI or NousResearch and community users.</p>
195
- <p>This list will be evolutive based on community suggestions, and will aim to include SOTA LLMs as they come out. We will also try to evaluate these models in priority, as they are more valuable to the community.</p>
196
- <p>We hope it will also make it easier for non ML users to better make their choice among the many, many models we evaluate.</p>
197
 
198
  <h3>Voting on model relevance</h3>
199
- <p>For the Open LLM Leaderboard v1, evaluations were run in a “first come, first served” manner. However, some users were submitting many new LLMs at once, blocking the queue for the rest of the community with experimental or low quality models.</p>
200
- <p>As the Open LLM Leaderboard is running on the spare cycles of the Hugging Face science cluster, our automatic evaluations can only take place when nodes are free. Any other job has a higher priority over our evaluations. When a new model is training or a dataset is brewing, users sometimes need to wait at least a couple of days, sometimes longer, for evaluations to be run. (But then, they get a cool model or dataset from our research team, like Idefics or FineWeb-Edu)!</p>
201
- <p>For the Open LLM Leaderboard v2, we have introduced a voting system for submitted models. It will prioritize running models with the most votes first, and if a model has an extremely high number of votes when the cluster is full, we’ll consider running it manually.</p>
202
- <p>For accountability, we request users who vote to be connected to their Hugging Face account, and we store all the votes. This will therefore prioritize models that the community is enthusiastic about, no matter their origin.</p>
203
-
204
  <h3>Better and simpler interface</h3>
205
- <p>Our regular users might have noticed that in the last month, our front end became much faster.</p>
206
- <p>This is thanks to the work of the Gradio team, notably Freddy Boulton, who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous!</p>
207
- <p>We’ve also decided to remove the FAQ and About tabs from the Leaderboard, as we noticed that a number of users were not finding the tabs, and it was crowding the interface. They are now in their own dedicated documentation page, that you can find here! # Results!</p>
208
- <p>For the version 2, we made the choice to initialize the leaderboard with the maintainer’s choice models only to start. But as always, submissions are open!</p>
209
-
210
  <h2>New leaderboard, new results!</h2>
 
211
 
212
- <h3>What about the rankings?</h3>
213
 
214
- <p>When looking at the top 10 of the Open LLM Leaderboard, and comparing the v2 and v1, 5 models appear to have a relatively stable ranking: Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
215
- <table>
 
 
 
 
216
  <tr>
217
  <th>Rank</th>
218
  <th>Leaderboard v1</th>
@@ -282,11 +285,13 @@
282
  </div>
283
  </div>
284
 
285
- <h3>Which evaluations should you pay most attention to?</h3>
286
- <p>Depending on your use case, you should look at different aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you could be interested in specific capabilities instead.</p>
287
 
288
- <p>For example, our different evaluations results are not all correlated with one another, which is expected.</p>
 
289
 
 
 
290
  <div class="main-plot-container">
291
  <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
292
  <div id="heatmap">
@@ -294,9 +299,8 @@
294
  </div>
295
  </div>
296
 
297
- <p>MMLU-Pro, BBH and ARC-challenge are well correlated together. It is known that these 3 are well correlated with human preference (as they tend to align with human judgment on LMSys’s chatbot arena).</p>
298
-
299
- <p>IFEval is also linked to chat-related capabilities, since it investigates whether models can follow precise instructions or not. However, contrary to the others, its format discriminates against chat or instruction tuned models, with pretrained models having a harder time performing as well.</p>
300
 
301
  <div class="l-body">
302
  <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
@@ -304,7 +308,8 @@
304
  </div>
305
 
306
 
307
- <p>If you are more interested in knowledge than alignment with human preference, the most relevant evaluations for you would be MMLU-Pro and GPQA.</p>
 
308
 
309
  <div class="l-body">
310
  <figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
@@ -312,30 +317,32 @@
312
  </div>
313
 
314
 
315
- <p>Both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with reference MMLU scores from the Open LLM Leaderboard v1. However, since GPQA is much harder, the scores are overall much lower.</p>
 
316
 
317
  <div class="l-body">
318
  <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
319
  <div id="math"></div>
320
  </div>
321
 
322
- <p>MATH-Lvl5 is, obviously, interesting for people concerned with math capabilities. Its results are correlated with GSM8K, except for some outliers. In the green box are models which scored 0 on GSM8K in the first leaderboard, but now have good scores on MATH-Level5 (mostly models from 01-ai) - it’s likely they were penalized by the previous format and stop tokens. In the red box are models which scored high on GSM8K but are now at 0 on MATH-Lvl5.From our current observations, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).This seems to imply that some chat tuning can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
323
-
324
  <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
 
325
 
326
  <h2>What’s next?</h2>
327
- <p>Much like the v1 drove model development during the last year, especially for the community, we hope that the v2 will be a cornerstone of model evaluations.</p>
328
- <p>You’ll still be able to find all the v1 results in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>, and we are preparing an in depth blog about what we learned while taking care of the leaderboard!</p>
 
329
 
330
  <div class="l-body">
331
  <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
332
  <div id="timewise"></div>
333
  </div>
334
 
335
-
336
- <p>When looking at the evolution of all submitted models on the Open LLM Leaderboard v1 through time, we observe a trend where we go from bigger (red dots) to smaller (yellow dots), but better performing models.</p>
337
- <p>We hope that we will observe similar patterns of progress with the leaderboard v2, where our starting point is much lower (black dots).</p>
338
-
339
 
340
  </d-article>
341
 
 
191
  </li>
192
  </ul>
193
  <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
194
+ <p>In this list, you’ll find LLMs from model creators with access to a lot of compute power such as Meta,Google, Cohere or Mistral, as well as well known collectives, like EleutherAI or NousResearch, and power users of the Hugging Face hub, among others.</p>
195
+ <p>We plan to make this list evolutive based on community suggestions and our own observations, and will aim to include as much as possible SOTA LLMs as they come out and keep evaluating these models in priority.</p>
196
+ <p>We hope it will also make it easier for non ML users to orient themselves among the many, many models we’ll rank on the leaderboard.</p>
197
 
198
  <h3>Voting on model relevance</h3>
199
+ <p>For the previous version of the Open LLM Leaderboard, evaluations were usually run in a “first submitted, first evaluated” manner. With users sometimes submitting many LLMs variants at once and the Open LLM Leaderboard running on the limited compute of the spare cycles on the Hugging Face science cluster, we’ve decided to introduce a voting system for submitted models. The community will be able to vote for models and we will prioritize running models with the most votes first, hopefully surfacing the most awaited models on the top of the priority stack. If a model gets an extremely high number of votes when the cluster is full, we could even consider running it manually in place of other internal jobs at Hugging Face.</p>
200
+ <p>To avoid spamming the vote system, users will need to be connected to their Hugging Face account to vote, and we will save the votes. We hope this system will help us prioritize models that the community is enthusiastic about.</p>
201
+ <p>Finally, we’ve been hard at work on improving and simplifying the leaderboard interface itself.</p>
202
+
 
203
  <h3>Better and simpler interface</h3>
204
+ <p>If you’re among our regular users, you may have noticed in the last month that our front end became much faster.</p>
205
+ <p>This is thanks to the work of the Gradio team, notably Freddy Boulton, who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous! It’s also a component that you can re-use yourself in your own leaderboard!</p>
206
+ <p>We’ve also decided to move the FAQ and About tabs to their own dedicated documentation page!</p>
207
+
 
208
  <h2>New leaderboard, new results!</h2>
209
+ <p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
210
 
211
+ <h3>What do the rankings look like?</h3>
212
 
213
+ <p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version, 5 models appear to have a relatively stable ranking: Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
214
+ <p>We’ve been particularly impressed by Llama-70B-instruct, ranking top across many evaluations (even though this instruct version loses 15 points to its pretrained version counterpart on GPQA which begs the question whether the particularly extensive instruction fine-tuning done by the Meta team on this model affected some expert/graduate level knowledge?).</p>
215
+ <p>Also very interesting is the fact that a new challenger climbed the ranks to reach 2nd place despite its smaller size. With only 13B parameters, Microsoft’s Phi-3-medium-4K-instruct model shows a performance equivalent to models 2 to 4 times its size. It would be very interesting to have more information on the training procedure for Phi or an independant reproduction from an external team with open training recipes/datasets.</p>
216
+ <p>Here is a detail of the changes in rankings:</p>
217
+
218
+ <table>
219
  <tr>
220
  <th>Rank</th>
221
  <th>Leaderboard v1</th>
 
285
  </div>
286
  </div>
287
 
288
+ <p>Let’s finish with some food for thoughts and advices from the maintainer’s team.</p>
 
289
 
290
+ <h3>Which evaluations should you pay most attention to?</h3>
291
+ <p>Depending on your practical use case, you should probably focus on various aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you might be more interested in specific capabilities.</p>
292
 
293
+ <p>In particular, we observed that our different evaluations results are not always correlated with one another as illustrated on this correlation matrice:</p>
294
+
295
  <div class="main-plot-container">
296
  <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
297
  <div id="heatmap">
 
299
  </div>
300
  </div>
301
 
302
+ <p>As you can see, MMLU-Pro, BBH and ARC-challenge are rather well correlated. As it’s been also noted by other teams, these 3 benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
303
+ <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
 
304
 
305
  <div class="l-body">
306
  <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
 
308
  </div>
309
 
310
 
311
+ <p>If you are especially interested in model knowledge rather than alignment or chat capabilities, the most relevant evaluations for you will likely be MMLU-Pro and GPQA.</p>
312
+ <p>Let’s see how performances on these updated benchmarks compare to our evaluation on the previous version of the leaderboard.</p>
313
 
314
  <div class="l-body">
315
  <figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
 
317
  </div>
318
 
319
 
320
+ <p>As we can see, both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with MMLU scores from the Open LLM Leaderboard v1. However, we note that the scores are overall much lower since GPQA is much harder. There is thus quite some room for model to improve – which is great news :)</p>
321
+ <p>MATH-Lvl5 is, obviously, interesting for people focusing on math capabilities. The results on this benchmark are generally correlated with performance on GSM8K, except for some outliers as we can see on the following figure.</p>
322
 
323
  <div class="l-body">
324
  <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
325
  <div id="math"></div>
326
  </div>
327
 
328
+ <p>In the green box, we highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. In the red box we show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5. From our current dive in the outputs and behaviors of these models, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).</p>
329
+ <p>This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
330
  <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
331
+ <p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
332
 
333
  <h2>What’s next?</h2>
334
+ <p>Much like the first version of the Open LLM Leaderboard pushed a community approach to model development during the past year, we hope that the new version 2 will be a milestone of open and reproducible model evaluations.</p>
335
+ <p>Because backward compatibility and open knowledge is important, you’ll still be able to find all the previous results archived in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>!</p>
336
+ <p>Taking a step back to look at the evolution of all the 7400 evaluated models on the Open LLM Leaderboard through time, we can note some much wider trends in the field! For instance we see a strong trend going from larger (red dots) models to smaller (yellow dots) models, while at the same time improving performance.</p>
337
 
338
  <div class="l-body">
339
  <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
340
  <div id="timewise"></div>
341
  </div>
342
 
343
+ <p>This is great news for the field as smaller models are much easier to embedded as well as much more energy/memory/compute efficient and we hope to observe a similar pattern of progress in the new version of the leaderboard Given our harder benchmarks, our starting point is for now much lower (black dots) so let’s see where the field take us in a few months from now :)</p>
344
+ <p>If you’ve read to this point, thanks a lot, we hope you’ll enjoy this new version of the Open LLM Leaderboard. May the open-source winds push our LLMs boats to sail far away on the sea of deep learning.</p>
345
+
 
346
 
347
  </d-article>
348
 
src/index.html CHANGED
@@ -191,28 +191,31 @@
191
  </li>
192
  </ul>
193
  <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
194
- <p>In this list, you’ll find LLMs from model creators who spent time, care, and a lot of compute on creating and delivering new cool models, from Meta and Google to Cohere or Mistral, as well as collectives like EleutherAI or NousResearch and community users.</p>
195
- <p>This list will be evolutive based on community suggestions, and will aim to include SOTA LLMs as they come out. We will also try to evaluate these models in priority, as they are more valuable to the community.</p>
196
- <p>We hope it will also make it easier for non ML users to better make their choice among the many, many models we evaluate.</p>
197
 
198
  <h3>Voting on model relevance</h3>
199
- <p>For the Open LLM Leaderboard v1, evaluations were run in a “first come, first served” manner. However, some users were submitting many new LLMs at once, blocking the queue for the rest of the community with experimental or low quality models.</p>
200
- <p>As the Open LLM Leaderboard is running on the spare cycles of the Hugging Face science cluster, our automatic evaluations can only take place when nodes are free. Any other job has a higher priority over our evaluations. When a new model is training or a dataset is brewing, users sometimes need to wait at least a couple of days, sometimes longer, for evaluations to be run. (But then, they get a cool model or dataset from our research team, like Idefics or FineWeb-Edu)!</p>
201
- <p>For the Open LLM Leaderboard v2, we have introduced a voting system for submitted models. It will prioritize running models with the most votes first, and if a model has an extremely high number of votes when the cluster is full, we’ll consider running it manually.</p>
202
- <p>For accountability, we request users who vote to be connected to their Hugging Face account, and we store all the votes. This will therefore prioritize models that the community is enthusiastic about, no matter their origin.</p>
203
-
204
  <h3>Better and simpler interface</h3>
205
- <p>Our regular users might have noticed that in the last month, our front end became much faster.</p>
206
- <p>This is thanks to the work of the Gradio team, notably Freddy Boulton, who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous!</p>
207
- <p>We’ve also decided to remove the FAQ and About tabs from the Leaderboard, as we noticed that a number of users were not finding the tabs, and it was crowding the interface. They are now in their own dedicated documentation page, that you can find here! # Results!</p>
208
- <p>For the version 2, we made the choice to initialize the leaderboard with the maintainer’s choice models only to start. But as always, submissions are open!</p>
209
-
210
  <h2>New leaderboard, new results!</h2>
 
211
 
212
- <h3>What about the rankings?</h3>
213
 
214
- <p>When looking at the top 10 of the Open LLM Leaderboard, and comparing the v2 and v1, 5 models appear to have a relatively stable ranking: Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
215
- <table>
 
 
 
 
216
  <tr>
217
  <th>Rank</th>
218
  <th>Leaderboard v1</th>
@@ -282,11 +285,13 @@
282
  </div>
283
  </div>
284
 
285
- <h3>Which evaluations should you pay most attention to?</h3>
286
- <p>Depending on your use case, you should look at different aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you could be interested in specific capabilities instead.</p>
287
 
288
- <p>For example, our different evaluations results are not all correlated with one another, which is expected.</p>
 
289
 
 
 
290
  <div class="main-plot-container">
291
  <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
292
  <div id="heatmap">
@@ -294,9 +299,8 @@
294
  </div>
295
  </div>
296
 
297
- <p>MMLU-Pro, BBH and ARC-challenge are well correlated together. It is known that these 3 are well correlated with human preference (as they tend to align with human judgment on LMSys’s chatbot arena).</p>
298
-
299
- <p>IFEval is also linked to chat-related capabilities, since it investigates whether models can follow precise instructions or not. However, contrary to the others, its format discriminates against chat or instruction tuned models, with pretrained models having a harder time performing as well.</p>
300
 
301
  <div class="l-body">
302
  <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
@@ -304,7 +308,8 @@
304
  </div>
305
 
306
 
307
- <p>If you are more interested in knowledge than alignment with human preference, the most relevant evaluations for you would be MMLU-Pro and GPQA.</p>
 
308
 
309
  <div class="l-body">
310
  <figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
@@ -312,30 +317,32 @@
312
  </div>
313
 
314
 
315
- <p>Both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with reference MMLU scores from the Open LLM Leaderboard v1. However, since GPQA is much harder, the scores are overall much lower.</p>
 
316
 
317
  <div class="l-body">
318
  <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
319
  <div id="math"></div>
320
  </div>
321
 
322
- <p>MATH-Lvl5 is, obviously, interesting for people concerned with math capabilities. Its results are correlated with GSM8K, except for some outliers. In the green box are models which scored 0 on GSM8K in the first leaderboard, but now have good scores on MATH-Level5 (mostly models from 01-ai) - it’s likely they were penalized by the previous format and stop tokens. In the red box are models which scored high on GSM8K but are now at 0 on MATH-Lvl5.From our current observations, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).This seems to imply that some chat tuning can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
323
-
324
  <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
 
325
 
326
  <h2>What’s next?</h2>
327
- <p>Much like the v1 drove model development during the last year, especially for the community, we hope that the v2 will be a cornerstone of model evaluations.</p>
328
- <p>You’ll still be able to find all the v1 results in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>, and we are preparing an in depth blog about what we learned while taking care of the leaderboard!</p>
 
329
 
330
  <div class="l-body">
331
  <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
332
  <div id="timewise"></div>
333
  </div>
334
 
335
-
336
- <p>When looking at the evolution of all submitted models on the Open LLM Leaderboard v1 through time, we observe a trend where we go from bigger (red dots) to smaller (yellow dots), but better performing models.</p>
337
- <p>We hope that we will observe similar patterns of progress with the leaderboard v2, where our starting point is much lower (black dots).</p>
338
-
339
 
340
  </d-article>
341
 
 
191
  </li>
192
  </ul>
193
  <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
194
+ <p>In this list, you’ll find LLMs from model creators with access to a lot of compute power such as Meta,Google, Cohere or Mistral, as well as well known collectives, like EleutherAI or NousResearch, and power users of the Hugging Face hub, among others.</p>
195
+ <p>We plan to make this list evolutive based on community suggestions and our own observations, and will aim to include as much as possible SOTA LLMs as they come out and keep evaluating these models in priority.</p>
196
+ <p>We hope it will also make it easier for non ML users to orient themselves among the many, many models we’ll rank on the leaderboard.</p>
197
 
198
  <h3>Voting on model relevance</h3>
199
+ <p>For the previous version of the Open LLM Leaderboard, evaluations were usually run in a “first submitted, first evaluated” manner. With users sometimes submitting many LLMs variants at once and the Open LLM Leaderboard running on the limited compute of the spare cycles on the Hugging Face science cluster, we’ve decided to introduce a voting system for submitted models. The community will be able to vote for models and we will prioritize running models with the most votes first, hopefully surfacing the most awaited models on the top of the priority stack. If a model gets an extremely high number of votes when the cluster is full, we could even consider running it manually in place of other internal jobs at Hugging Face.</p>
200
+ <p>To avoid spamming the vote system, users will need to be connected to their Hugging Face account to vote, and we will save the votes. We hope this system will help us prioritize models that the community is enthusiastic about.</p>
201
+ <p>Finally, we’ve been hard at work on improving and simplifying the leaderboard interface itself.</p>
202
+
 
203
  <h3>Better and simpler interface</h3>
204
+ <p>If you’re among our regular users, you may have noticed in the last month that our front end became much faster.</p>
205
+ <p>This is thanks to the work of the Gradio team, notably Freddy Boulton, who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous! It’s also a component that you can re-use yourself in your own leaderboard!</p>
206
+ <p>We’ve also decided to move the FAQ and About tabs to their own dedicated documentation page!</p>
207
+
 
208
  <h2>New leaderboard, new results!</h2>
209
+ <p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
210
 
211
+ <h3>What do the rankings look like?</h3>
212
 
213
+ <p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version, 5 models appear to have a relatively stable ranking: Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
214
+ <p>We’ve been particularly impressed by Llama-70B-instruct, ranking top across many evaluations (even though this instruct version loses 15 points to its pretrained version counterpart on GPQA which begs the question whether the particularly extensive instruction fine-tuning done by the Meta team on this model affected some expert/graduate level knowledge?).</p>
215
+ <p>Also very interesting is the fact that a new challenger climbed the ranks to reach 2nd place despite its smaller size. With only 13B parameters, Microsoft’s Phi-3-medium-4K-instruct model shows a performance equivalent to models 2 to 4 times its size. It would be very interesting to have more information on the training procedure for Phi or an independant reproduction from an external team with open training recipes/datasets.</p>
216
+ <p>Here is a detail of the changes in rankings:</p>
217
+
218
+ <table>
219
  <tr>
220
  <th>Rank</th>
221
  <th>Leaderboard v1</th>
 
285
  </div>
286
  </div>
287
 
288
+ <p>Let’s finish with some food for thoughts and advices from the maintainer’s team.</p>
 
289
 
290
+ <h3>Which evaluations should you pay most attention to?</h3>
291
+ <p>Depending on your practical use case, you should probably focus on various aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you might be more interested in specific capabilities.</p>
292
 
293
+ <p>In particular, we observed that our different evaluations results are not always correlated with one another as illustrated on this correlation matrice:</p>
294
+
295
  <div class="main-plot-container">
296
  <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
297
  <div id="heatmap">
 
299
  </div>
300
  </div>
301
 
302
+ <p>As you can see, MMLU-Pro, BBH and ARC-challenge are rather well correlated. As it’s been also noted by other teams, these 3 benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
303
+ <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
 
304
 
305
  <div class="l-body">
306
  <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
 
308
  </div>
309
 
310
 
311
+ <p>If you are especially interested in model knowledge rather than alignment or chat capabilities, the most relevant evaluations for you will likely be MMLU-Pro and GPQA.</p>
312
+ <p>Let’s see how performances on these updated benchmarks compare to our evaluation on the previous version of the leaderboard.</p>
313
 
314
  <div class="l-body">
315
  <figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
 
317
  </div>
318
 
319
 
320
+ <p>As we can see, both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with MMLU scores from the Open LLM Leaderboard v1. However, we note that the scores are overall much lower since GPQA is much harder. There is thus quite some room for model to improve – which is great news :)</p>
321
+ <p>MATH-Lvl5 is, obviously, interesting for people focusing on math capabilities. The results on this benchmark are generally correlated with performance on GSM8K, except for some outliers as we can see on the following figure.</p>
322
 
323
  <div class="l-body">
324
  <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
325
  <div id="math"></div>
326
  </div>
327
 
328
+ <p>In the green box, we highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. In the red box we show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5. From our current dive in the outputs and behaviors of these models, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).</p>
329
+ <p>This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
330
  <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
331
+ <p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
332
 
333
  <h2>What’s next?</h2>
334
+ <p>Much like the first version of the Open LLM Leaderboard pushed a community approach to model development during the past year, we hope that the new version 2 will be a milestone of open and reproducible model evaluations.</p>
335
+ <p>Because backward compatibility and open knowledge is important, you’ll still be able to find all the previous results archived in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>!</p>
336
+ <p>Taking a step back to look at the evolution of all the 7400 evaluated models on the Open LLM Leaderboard through time, we can note some much wider trends in the field! For instance we see a strong trend going from larger (red dots) models to smaller (yellow dots) models, while at the same time improving performance.</p>
337
 
338
  <div class="l-body">
339
  <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
340
  <div id="timewise"></div>
341
  </div>
342
 
343
+ <p>This is great news for the field as smaller models are much easier to embedded as well as much more energy/memory/compute efficient and we hope to observe a similar pattern of progress in the new version of the leaderboard Given our harder benchmarks, our starting point is for now much lower (black dots) so let’s see where the field take us in a few months from now :)</p>
344
+ <p>If you’ve read to this point, thanks a lot, we hope you’ll enjoy this new version of the Open LLM Leaderboard. May the open-source winds push our LLMs boats to sail far away on the sea of deep learning.</p>
345
+
 
346
 
347
  </d-article>
348