Spaces:

open-llm-leaderboard
/

blog

Running

App Files Files Community

Clémentine commited on 10 days ago

Commit

7f9759c

•

1 Parent(s): a9439f1

reorganised so that the scripts are loaded as assets

Browse files

Files changed (10) hide show

{dist → assets/scripts}/correlation_heatmap.html +0 -0
{dist → assets/scripts}/normalized_vs_raw.html +0 -0
{dist → assets/scripts}/plot.html +0 -0
{dist → assets/scripts}/rankings_change.html +0 -0
dist/assets/scripts/correlation_heatmap.html +0 -0
dist/assets/scripts/normalized_vs_raw.html +0 -0
dist/assets/scripts/plot.html +0 -0
dist/assets/scripts/rankings_change.html +0 -0
dist/index.html +11 -8
src/index.html +61 -15

{dist → assets/scripts}/correlation_heatmap.html RENAMED Viewed

File without changes

{dist → assets/scripts}/normalized_vs_raw.html RENAMED Viewed

File without changes

{dist → assets/scripts}/plot.html RENAMED Viewed

File without changes

{dist → assets/scripts}/rankings_change.html RENAMED Viewed

File without changes

dist/assets/scripts/correlation_heatmap.html ADDED Viewed

The diff for this file is too large to render. See raw diff

dist/assets/scripts/normalized_vs_raw.html ADDED Viewed

The diff for this file is too large to render. See raw diff

dist/assets/scripts/plot.html ADDED Viewed

The diff for this file is too large to render. See raw diff

dist/assets/scripts/rankings_change.html ADDED Viewed

The diff for this file is too large to render. See raw diff

dist/index.html CHANGED Viewed

@@ -91,7 +91,7 @@
             <div class="main-plot-container">
                 <div id="saturation">
-                    <iframe src="plot.html" title="description", height="500" width="90%", style="border:none;"></iframe></p>
                 </div>
             </div>
@@ -142,10 +142,13 @@
                 </ul>
             </ol>
-            <p><em>Should we have included more evaluations?</em></p>
-            <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. There are many other evaluations which we wanted to include (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
-            <p>But selecting new benchmarks is not the whole story, we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
         <h3>Reporting a fairer average for ranking: using normalized scores</h3>
             <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
@@ -155,7 +158,7 @@
                     replace the class `l-body` by `main-plot-container` and import your interactive plot in the
                     below div id, while leaving the image as such. -->
                 <div id="normalisation">
-                    <iframe src="normalized_vs_raw.html" title="description", height="500" width="90%", style="border:none;"></iframe>
                 </div>
             </div>
@@ -188,7 +191,7 @@
                     </li>
                 </ul>
             <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
-            <p>In this list, you’ll find LLMs from model creators who spent time and care on creating and delivering new cool models. We include big companies like Meta or Google, startups like Cohere or Mistral, collectives, like EleutherAI or NousResearch, and users, among many others.</p>
             <p>This list will be evolutive based on community suggestions, and will aim to include SOTA LLMs as they come out. We will also try to evaluate these models in priority, as they are more valuable to the community.</p>
             <p>We hope it will also make it easier for non ML users to better make their choice among the many, many models we evaluate.</p>
@@ -275,7 +278,7 @@
             <div class="main-plot-container">
                 <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
                 <div id="ranking">
-                    <iframe src="rankings_change.html" title="description", height="800" width="100%", style="border:none;"></iframe>
                 </div>
             </div>
@@ -287,7 +290,7 @@
             <div class="main-plot-container">
                 <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
                 <div id="heatmap">
-                    <iframe src="correlation_heatmap.html" title="description", height="500" width="100%", style="border:none;"></iframe>
                 </div>
             </div>

             <div class="main-plot-container">
                 <div id="saturation">
+                    <iframe src="assets/scripts/plot.html" title="description", height="500" width="90%", style="border:none;"></iframe></p>
                 </div>
             </div>
                 </ul>
             </ol>
+            <aside>
+                <p><em>Should we have included more evaluations?</em></p>
+                <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. There are many other evaluations which we wanted to include (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
+            </aside>
+            <p>Selecting new benchmarks is not the whole story, and we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
         <h3>Reporting a fairer average for ranking: using normalized scores</h3>
             <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
                     replace the class `l-body` by `main-plot-container` and import your interactive plot in the
                     below div id, while leaving the image as such. -->
                 <div id="normalisation">
+                    <iframe src="assets/scripts/normalized_vs_raw.html" title="description", height="500" width="90%", style="border:none;"></iframe>
                 </div>
             </div>
                     </li>
                 </ul>
             <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
+            <p>In this list, you’ll find LLMs from model creators who spent time, care, and a lot of compute on creating and delivering new cool models, from Meta and Google to Cohere or Mistral, as well as collectives like EleutherAI or NousResearch and community users.</p>
             <p>This list will be evolutive based on community suggestions, and will aim to include SOTA LLMs as they come out. We will also try to evaluate these models in priority, as they are more valuable to the community.</p>
             <p>We hope it will also make it easier for non ML users to better make their choice among the many, many models we evaluate.</p>
             <div class="main-plot-container">
                 <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
                 <div id="ranking">
+                    <iframe src="assets/scripts/rankings_change.html" title="description", height="800" width="100%", style="border:none;"></iframe>
                 </div>
             </div>
             <div class="main-plot-container">
                 <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
                 <div id="heatmap">
+                    <iframe src="assets/scripts/correlation_heatmap.html" title="description", height="500" width="100%", style="border:none;"></iframe>
                 </div>
             </div>

src/index.html CHANGED Viewed

@@ -78,17 +78,21 @@
     <p>Here is why we think a new leaderboard was needed 👇</p>
     <h2>Harder, better, faster, stronger: Introducing the Leaderboard v2</h2>
     <h3>The need for a more challenging leaderboard</h3>
             <p>
             Over the past year, the benchmarks we were using got overused/saturated:
             </p>
-            <div class="l-body">
-                <figure><img src="assets/images/saturation.png"/></figure>
-                <div id="saturation"></div>
             </div>
             <ol>
@@ -138,20 +142,24 @@
                 </ul>
             </ol>
-            <p><em>Should we have included more evaluations?</em></p>
-            <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. There are many other evaluations which we wanted to include (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
-            <p>But selecting new benchmarks is not the whole story, we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
         <h3>Reporting a fairer average for ranking: using normalized scores</h3>
             <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
-            <div class="l-body">
                 <!--todo: if you use an interactive visualisation instead of a plot,
                     replace the class `l-body` by `main-plot-container` and import your interactive plot in the
                     below div id, while leaving the image as such. -->
-                <figure><img src="assets/images/normalized_vs_raw_scores.png"/></figure>
-                <div id="normalisation"></div>
             </div>
             <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
@@ -183,7 +191,7 @@
                     </li>
                 </ul>
             <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
-            <p>In this list, you’ll find LLMs from model creators who spent time and care on creating and delivering new cool models. We include big companies like Meta or Google, startups like Cohere or Mistral, collectives, like EleutherAI or NousResearch, and users, among many others.</p>
             <p>This list will be evolutive based on community suggestions, and will aim to include SOTA LLMs as they come out. We will also try to evaluate these models in priority, as they are more valuable to the community.</p>
             <p>We hope it will also make it easier for non ML users to better make their choice among the many, many models we evaluate.</p>
@@ -267,9 +275,11 @@
             <p>We also provide the most important top and bottom ranking changes.</p>
-            <div class="l-body">
                 <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
-                <div id="ranking"></div>
             </div>
         <h3>Which evaluations should you pay most attention to?</h3>
@@ -277,9 +287,11 @@
             <p>For example, our different evaluations results are not all correlated with one another, which is expected.</p>
-            <div class="l-body">
                 <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
-                <div id="heatmap"></div>
             </div>
             <p>MMLU-Pro, BBH and ARC-challenge are well correlated together. It is known that these 3 are well correlated with human preference (as they tend to align with human judgment on LMSys’s chatbot arena).</p>
@@ -393,6 +405,40 @@
             }
         });
     }
 </script>
 </body>
-</html>

     <p>Here is why we think a new leaderboard was needed 👇</p>
     <h2>Harder, better, faster, stronger: Introducing the Leaderboard v2</h2>
     <h3>The need for a more challenging leaderboard</h3>
             <p>
             Over the past year, the benchmarks we were using got overused/saturated:
             </p>
+            <div class="main-plot-container">
+                <div id="saturation">
+                    <iframe src="assets/scripts/plot.html" title="description", height="500" width="90%", style="border:none;"></iframe></p>
+                </div>
             </div>
             <ol>
                 </ul>
             </ol>
+            <aside>
+                <p><em>Should we have included more evaluations?</em></p>
+                <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. There are many other evaluations which we wanted to include (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
+            </aside>
+            <p>Selecting new benchmarks is not the whole story, and we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
         <h3>Reporting a fairer average for ranking: using normalized scores</h3>
             <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
+            <div class="main-plot-container">
                 <!--todo: if you use an interactive visualisation instead of a plot,
                     replace the class `l-body` by `main-plot-container` and import your interactive plot in the
                     below div id, while leaving the image as such. -->
+                <div id="normalisation">
+                    <iframe src="assets/scripts/normalized_vs_raw.html" title="description", height="500" width="90%", style="border:none;"></iframe>
+                </div>
             </div>
             <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
                     </li>
                 </ul>
             <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
+            <p>In this list, you’ll find LLMs from model creators who spent time, care, and a lot of compute on creating and delivering new cool models, from Meta and Google to Cohere or Mistral, as well as collectives like EleutherAI or NousResearch and community users.</p>
             <p>This list will be evolutive based on community suggestions, and will aim to include SOTA LLMs as they come out. We will also try to evaluate these models in priority, as they are more valuable to the community.</p>
             <p>We hope it will also make it easier for non ML users to better make their choice among the many, many models we evaluate.</p>
             <p>We also provide the most important top and bottom ranking changes.</p>
+            <div class="main-plot-container">
                 <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
+                <div id="ranking">
+                    <iframe src="assets/scripts/rankings_change.html" title="description", height="800" width="100%", style="border:none;"></iframe>
+                </div>
             </div>
         <h3>Which evaluations should you pay most attention to?</h3>
             <p>For example, our different evaluations results are not all correlated with one another, which is expected.</p>
+            <div class="main-plot-container">
                 <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
+                <div id="heatmap">
+                    <iframe src="assets/scripts/correlation_heatmap.html" title="description", height="500" width="100%", style="border:none;"></iframe>
+                </div>
             </div>
             <p>MMLU-Pro, BBH and ARC-challenge are well correlated together. It is known that these 3 are well correlated with human preference (as they tend to align with human judgment on LMSys’s chatbot arena).</p>
             }
         });
     }
+    function includeHTML() {
+        var z, i, elmnt, file, xhttp;
+        /* Loop through a collection of all HTML elements: */
+        z = document.getElementsByTagName("*");
+        for (i = 0; i < z.length; i++) {
+            elmnt = z[i];
+            /*search for elements with a certain atrribute:*/
+            file = elmnt.getAttribute("w3-include-html");
+            /* print the file on the console */
+            console.log("HELP");
+            console.log(file);
+            if (file) {
+                /* Make an HTTP request using the attribute value as the file name: */
+                xhttp = new XMLHttpRequest();
+                xhttp.onreadystatechange = function() {
+                    if (this.readyState == 4) {
+                        if (this.status == 200) {elmnt.innerHTML = this.responseText;}
+                        if (this.status == 404) {elmnt.innerHTML = "Page not found.";}
+                        /* Remove the attribute, and call this function once more: */
+                        elmnt.removeAttribute("w3-include-html");
+                        includeHTML();
+                    }
+                }
+                xhttp.open("GET", file, true);
+                xhttp.send();
+                /* Exit the function: */
+                return;
+            }
+        }
+}
+</script>
+<script>
+includeHTML();
 </script>
 </body>
+</html>