Spaces:

open-llm-leaderboard
/

blog

Running

App Files Files Community

Nathan Habib commited on 10 days ago

Commit

7375a0d

•

1 Parent(s): d9cbfab

npm build

Browse files

Files changed (10) hide show

dist/assets/scripts/avg_ifeval_vs_all.html +0 -0
dist/assets/scripts/correlation_heatmap.html +0 -0
dist/assets/scripts/math_vs_avg_all.html +0 -0
dist/assets/scripts/math_vs_gsm8k.html +0 -0
dist/assets/scripts/model_size_vs_perf.html +0 -0
dist/assets/scripts/normalized_vs_raw.html +0 -0
dist/assets/scripts/nwe_scores_vs_old.html +0 -0
dist/assets/scripts/plot.html +0 -0
dist/assets/scripts/rankings_change.html +0 -0
dist/index.html +17 -9

dist/assets/scripts/avg_ifeval_vs_all.html ADDED Viewed

The diff for this file is too large to render. See raw diff

dist/assets/scripts/correlation_heatmap.html CHANGED Viewed

The diff for this file is too large to render. See raw diff

dist/assets/scripts/math_vs_avg_all.html ADDED Viewed

The diff for this file is too large to render. See raw diff

dist/assets/scripts/math_vs_gsm8k.html ADDED Viewed

The diff for this file is too large to render. See raw diff

dist/assets/scripts/model_size_vs_perf.html ADDED Viewed

The diff for this file is too large to render. See raw diff

dist/assets/scripts/normalized_vs_raw.html CHANGED Viewed

The diff for this file is too large to render. See raw diff

dist/assets/scripts/nwe_scores_vs_old.html ADDED Viewed

The diff for this file is too large to render. See raw diff

dist/assets/scripts/plot.html CHANGED Viewed

The diff for this file is too large to render. See raw diff

dist/assets/scripts/rankings_change.html CHANGED Viewed

The diff for this file is too large to render. See raw diff

dist/index.html CHANGED Viewed

@@ -295,34 +295,40 @@
             <div class="main-plot-container">
                 <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
                 <div id="heatmap">
-                    <iframe src="assets/scripts/correlation_heatmap.html" title="description", height="500" width="100%", style="border:none;"></iframe>
                 </div>
             </div>
             <p>As you can see, MMLU-Pro, BBH and ARC-challenge are rather well correlated. As it’s been also noted by other teams, these 3 benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
             <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
-            <div class="l-body">
                 <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
-                <div id="ifeval"></div>
             </div>
             <p>If you are especially interested in model knowledge rather than alignment or chat capabilities, the most relevant evaluations for you will likely be MMLU-Pro and GPQA.</p>
             <p>Let’s see how performances on these updated benchmarks compare to our evaluation on the previous version of the leaderboard.</p>
-            <div class="l-body">
                 <figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
-                <div id="mmlu"></div>
             </div>
             <p>As we can see, both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with MMLU scores from the Open LLM Leaderboard v1. However, we note that the scores are overall much lower since GPQA is much harder. There is thus quite some room for model to improve – which is great news :)</p>
             <p>MATH-Lvl5 is, obviously, interesting for people focusing on math capabilities. The results on this benchmark are generally correlated with performance on GSM8K, except for some outliers as we can see on the following figure.</p>
-            <div class="l-body">
                 <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
-                <div id="math"></div>
             </div>
             <p>In the green box, we highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. In the red box we show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5. From our current dive in the outputs and behaviors of these models, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).</p>
@@ -335,9 +341,11 @@
         <p>Because backward compatibility and open knowledge is important, you’ll still be able to find all the previous results archived in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>!</p>
         <p>Taking a step back to look at the evolution of all the 7400 evaluated models on the Open LLM Leaderboard through time, we can note some much wider trends in the field! For instance we see a strong trend going from larger (red dots) models to smaller (yellow dots) models, while at the same time improving performance.</p>
-        <div class="l-body">
             <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
-            <div id="timewise"></div>
         </div>
         <p>This is great news for the field as smaller models are much easier to embedded as well as much more energy/memory/compute efficient and we hope to observe a similar pattern of progress in the new version of the leaderboard Given our harder benchmarks, our starting point is for now much lower (black dots) so let’s see where the field take us in a few months from now :)</p>

             <div class="main-plot-container">
                 <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
                 <div id="heatmap">
+                    <iframe src="assets/scripts/correlation_heatmap.html" title="description", height="550" width="100%", style="border:none;"></iframe>
                 </div>
             </div>
             <p>As you can see, MMLU-Pro, BBH and ARC-challenge are rather well correlated. As it’s been also noted by other teams, these 3 benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
             <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
+            <div class="main-plot-container">
                 <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
+                <div id="ifeval">
+                    <iframe src="assets/scripts/avg_ifeval_vs_all.html" title="description", height="500" width="100%", style="border:none;"></iframe>
+                </div>
             </div>
             <p>If you are especially interested in model knowledge rather than alignment or chat capabilities, the most relevant evaluations for you will likely be MMLU-Pro and GPQA.</p>
             <p>Let’s see how performances on these updated benchmarks compare to our evaluation on the previous version of the leaderboard.</p>
+            <div class="main-plot-container">
                 <figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
+                <div id="mmlu">
+                    <iframe src="assets/scripts/nwe_scores_vs_old.html" title="description", height="500" width="100%", style="border:none;"></iframe>
+                </div>
             </div>
             <p>As we can see, both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with MMLU scores from the Open LLM Leaderboard v1. However, we note that the scores are overall much lower since GPQA is much harder. There is thus quite some room for model to improve – which is great news :)</p>
             <p>MATH-Lvl5 is, obviously, interesting for people focusing on math capabilities. The results on this benchmark are generally correlated with performance on GSM8K, except for some outliers as we can see on the following figure.</p>
+            <div class="main-plot-container">
                 <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
+                <div id="math">
+                    <iframe src="assets/scripts/math_vs_gsm8k.html" title="description", height="500" width="100%", style="border:none;"></iframe>
+                </div>
             </div>
             <p>In the green box, we highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. In the red box we show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5. From our current dive in the outputs and behaviors of these models, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).</p>
         <p>Because backward compatibility and open knowledge is important, you’ll still be able to find all the previous results archived in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>!</p>
         <p>Taking a step back to look at the evolution of all the 7400 evaluated models on the Open LLM Leaderboard through time, we can note some much wider trends in the field! For instance we see a strong trend going from larger (red dots) models to smaller (yellow dots) models, while at the same time improving performance.</p>
+        <div class="main-plot-container">
             <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
+            <div id="timewise">
+                <iframe src="assets/scripts/model_size_vs_perf.html" title="description", height="500" width="100%", style="border:none;"></iframe>
+            </div>
         </div>
         <p>This is great news for the field as smaller models are much easier to embedded as well as much more energy/memory/compute efficient and we hope to observe a similar pattern of progress in the new version of the leaderboard Given our harder benchmarks, our starting point is for now much lower (black dots) so let’s see where the field take us in a few months from now :)</p>