Nathan Habib commited on
Commit
7375a0d
1 Parent(s): d9cbfab
dist/assets/scripts/avg_ifeval_vs_all.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/assets/scripts/correlation_heatmap.html CHANGED
The diff for this file is too large to render. See raw diff
 
dist/assets/scripts/math_vs_avg_all.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/assets/scripts/math_vs_gsm8k.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/assets/scripts/model_size_vs_perf.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/assets/scripts/normalized_vs_raw.html CHANGED
The diff for this file is too large to render. See raw diff
 
dist/assets/scripts/nwe_scores_vs_old.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/assets/scripts/plot.html CHANGED
The diff for this file is too large to render. See raw diff
 
dist/assets/scripts/rankings_change.html CHANGED
The diff for this file is too large to render. See raw diff
 
dist/index.html CHANGED
@@ -295,34 +295,40 @@
295
  <div class="main-plot-container">
296
  <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
297
  <div id="heatmap">
298
- <iframe src="assets/scripts/correlation_heatmap.html" title="description", height="500" width="100%", style="border:none;"></iframe>
299
  </div>
300
  </div>
301
 
302
  <p>As you can see, MMLU-Pro, BBH and ARC-challenge are rather well correlated. As it’s been also noted by other teams, these 3 benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
303
  <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
304
 
305
- <div class="l-body">
306
  <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
307
- <div id="ifeval"></div>
 
 
308
  </div>
309
 
310
 
311
  <p>If you are especially interested in model knowledge rather than alignment or chat capabilities, the most relevant evaluations for you will likely be MMLU-Pro and GPQA.</p>
312
  <p>Let’s see how performances on these updated benchmarks compare to our evaluation on the previous version of the leaderboard.</p>
313
 
314
- <div class="l-body">
315
  <figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
316
- <div id="mmlu"></div>
 
 
317
  </div>
318
 
319
 
320
  <p>As we can see, both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with MMLU scores from the Open LLM Leaderboard v1. However, we note that the scores are overall much lower since GPQA is much harder. There is thus quite some room for model to improve – which is great news :)</p>
321
  <p>MATH-Lvl5 is, obviously, interesting for people focusing on math capabilities. The results on this benchmark are generally correlated with performance on GSM8K, except for some outliers as we can see on the following figure.</p>
322
 
323
- <div class="l-body">
324
  <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
325
- <div id="math"></div>
 
 
326
  </div>
327
 
328
  <p>In the green box, we highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. In the red box we show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5. From our current dive in the outputs and behaviors of these models, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).</p>
@@ -335,9 +341,11 @@
335
  <p>Because backward compatibility and open knowledge is important, you’ll still be able to find all the previous results archived in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>!</p>
336
  <p>Taking a step back to look at the evolution of all the 7400 evaluated models on the Open LLM Leaderboard through time, we can note some much wider trends in the field! For instance we see a strong trend going from larger (red dots) models to smaller (yellow dots) models, while at the same time improving performance.</p>
337
 
338
- <div class="l-body">
339
  <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
340
- <div id="timewise"></div>
 
 
341
  </div>
342
 
343
  <p>This is great news for the field as smaller models are much easier to embedded as well as much more energy/memory/compute efficient and we hope to observe a similar pattern of progress in the new version of the leaderboard Given our harder benchmarks, our starting point is for now much lower (black dots) so let’s see where the field take us in a few months from now :)</p>
 
295
  <div class="main-plot-container">
296
  <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
297
  <div id="heatmap">
298
+ <iframe src="assets/scripts/correlation_heatmap.html" title="description", height="550" width="100%", style="border:none;"></iframe>
299
  </div>
300
  </div>
301
 
302
  <p>As you can see, MMLU-Pro, BBH and ARC-challenge are rather well correlated. As it’s been also noted by other teams, these 3 benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
303
  <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
304
 
305
+ <div class="main-plot-container">
306
  <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
307
+ <div id="ifeval">
308
+ <iframe src="assets/scripts/avg_ifeval_vs_all.html" title="description", height="500" width="100%", style="border:none;"></iframe>
309
+ </div>
310
  </div>
311
 
312
 
313
  <p>If you are especially interested in model knowledge rather than alignment or chat capabilities, the most relevant evaluations for you will likely be MMLU-Pro and GPQA.</p>
314
  <p>Let’s see how performances on these updated benchmarks compare to our evaluation on the previous version of the leaderboard.</p>
315
 
316
+ <div class="main-plot-container">
317
  <figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
318
+ <div id="mmlu">
319
+ <iframe src="assets/scripts/nwe_scores_vs_old.html" title="description", height="500" width="100%", style="border:none;"></iframe>
320
+ </div>
321
  </div>
322
 
323
 
324
  <p>As we can see, both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with MMLU scores from the Open LLM Leaderboard v1. However, we note that the scores are overall much lower since GPQA is much harder. There is thus quite some room for model to improve – which is great news :)</p>
325
  <p>MATH-Lvl5 is, obviously, interesting for people focusing on math capabilities. The results on this benchmark are generally correlated with performance on GSM8K, except for some outliers as we can see on the following figure.</p>
326
 
327
+ <div class="main-plot-container">
328
  <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
329
+ <div id="math">
330
+ <iframe src="assets/scripts/math_vs_gsm8k.html" title="description", height="500" width="100%", style="border:none;"></iframe>
331
+ </div>
332
  </div>
333
 
334
  <p>In the green box, we highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. In the red box we show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5. From our current dive in the outputs and behaviors of these models, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).</p>
 
341
  <p>Because backward compatibility and open knowledge is important, you’ll still be able to find all the previous results archived in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>!</p>
342
  <p>Taking a step back to look at the evolution of all the 7400 evaluated models on the Open LLM Leaderboard through time, we can note some much wider trends in the field! For instance we see a strong trend going from larger (red dots) models to smaller (yellow dots) models, while at the same time improving performance.</p>
343
 
344
+ <div class="main-plot-container">
345
  <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
346
+ <div id="timewise">
347
+ <iframe src="assets/scripts/model_size_vs_perf.html" title="description", height="500" width="100%", style="border:none;"></iframe>
348
+ </div>
349
  </div>
350
 
351
  <p>This is great news for the field as smaller models are much easier to embedded as well as much more energy/memory/compute efficient and we hope to observe a similar pattern of progress in the new version of the leaderboard Given our harder benchmarks, our starting point is for now much lower (black dots) so let’s see where the field take us in a few months from now :)</p>