Clémentine commited on
Commit
7f9759c
1 Parent(s): a9439f1

reorganised so that the scripts are loaded as assets

Browse files
{dist → assets/scripts}/correlation_heatmap.html RENAMED
File without changes
{dist → assets/scripts}/normalized_vs_raw.html RENAMED
File without changes
{dist → assets/scripts}/plot.html RENAMED
File without changes
{dist → assets/scripts}/rankings_change.html RENAMED
File without changes
dist/assets/scripts/correlation_heatmap.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/assets/scripts/normalized_vs_raw.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/assets/scripts/plot.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/assets/scripts/rankings_change.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/index.html CHANGED
@@ -91,7 +91,7 @@
91
 
92
  <div class="main-plot-container">
93
  <div id="saturation">
94
- <iframe src="plot.html" title="description", height="500" width="90%", style="border:none;"></iframe></p>
95
  </div>
96
  </div>
97
 
@@ -142,10 +142,13 @@
142
  </ul>
143
  </ol>
144
 
145
- <p><em>Should we have included more evaluations?</em></p>
 
146
 
147
- <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. There are many other evaluations which we wanted to include (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
148
- <p>But selecting new benchmarks is not the whole story, we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
 
 
149
 
150
  <h3>Reporting a fairer average for ranking: using normalized scores</h3>
151
  <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
@@ -155,7 +158,7 @@
155
  replace the class `l-body` by `main-plot-container` and import your interactive plot in the
156
  below div id, while leaving the image as such. -->
157
  <div id="normalisation">
158
- <iframe src="normalized_vs_raw.html" title="description", height="500" width="90%", style="border:none;"></iframe>
159
  </div>
160
  </div>
161
 
@@ -188,7 +191,7 @@
188
  </li>
189
  </ul>
190
  <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
191
- <p>In this list, you’ll find LLMs from model creators who spent time and care on creating and delivering new cool models. We include big companies like Meta or Google, startups like Cohere or Mistral, collectives, like EleutherAI or NousResearch, and users, among many others.</p>
192
  <p>This list will be evolutive based on community suggestions, and will aim to include SOTA LLMs as they come out. We will also try to evaluate these models in priority, as they are more valuable to the community.</p>
193
  <p>We hope it will also make it easier for non ML users to better make their choice among the many, many models we evaluate.</p>
194
 
@@ -275,7 +278,7 @@
275
  <div class="main-plot-container">
276
  <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
277
  <div id="ranking">
278
- <iframe src="rankings_change.html" title="description", height="800" width="100%", style="border:none;"></iframe>
279
  </div>
280
  </div>
281
 
@@ -287,7 +290,7 @@
287
  <div class="main-plot-container">
288
  <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
289
  <div id="heatmap">
290
- <iframe src="correlation_heatmap.html" title="description", height="500" width="100%", style="border:none;"></iframe>
291
  </div>
292
  </div>
293
 
 
91
 
92
  <div class="main-plot-container">
93
  <div id="saturation">
94
+ <iframe src="assets/scripts/plot.html" title="description", height="500" width="90%", style="border:none;"></iframe></p>
95
  </div>
96
  </div>
97
 
 
142
  </ul>
143
  </ol>
144
 
145
+ <aside>
146
+ <p><em>Should we have included more evaluations?</em></p>
147
 
148
+ <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. There are many other evaluations which we wanted to include (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
149
+ </aside>
150
+
151
+ <p>Selecting new benchmarks is not the whole story, and we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
152
 
153
  <h3>Reporting a fairer average for ranking: using normalized scores</h3>
154
  <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
 
158
  replace the class `l-body` by `main-plot-container` and import your interactive plot in the
159
  below div id, while leaving the image as such. -->
160
  <div id="normalisation">
161
+ <iframe src="assets/scripts/normalized_vs_raw.html" title="description", height="500" width="90%", style="border:none;"></iframe>
162
  </div>
163
  </div>
164
 
 
191
  </li>
192
  </ul>
193
  <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
194
+ <p>In this list, you’ll find LLMs from model creators who spent time, care, and a lot of compute on creating and delivering new cool models, from Meta and Google to Cohere or Mistral, as well as collectives like EleutherAI or NousResearch and community users.</p>
195
  <p>This list will be evolutive based on community suggestions, and will aim to include SOTA LLMs as they come out. We will also try to evaluate these models in priority, as they are more valuable to the community.</p>
196
  <p>We hope it will also make it easier for non ML users to better make their choice among the many, many models we evaluate.</p>
197
 
 
278
  <div class="main-plot-container">
279
  <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
280
  <div id="ranking">
281
+ <iframe src="assets/scripts/rankings_change.html" title="description", height="800" width="100%", style="border:none;"></iframe>
282
  </div>
283
  </div>
284
 
 
290
  <div class="main-plot-container">
291
  <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
292
  <div id="heatmap">
293
+ <iframe src="assets/scripts/correlation_heatmap.html" title="description", height="500" width="100%", style="border:none;"></iframe>
294
  </div>
295
  </div>
296
 
src/index.html CHANGED
@@ -78,17 +78,21 @@
78
 
79
  <p>Here is why we think a new leaderboard was needed 👇</p>
80
 
 
81
  <h2>Harder, better, faster, stronger: Introducing the Leaderboard v2</h2>
82
 
 
 
83
  <h3>The need for a more challenging leaderboard</h3>
84
 
85
  <p>
86
  Over the past year, the benchmarks we were using got overused/saturated:
87
  </p>
88
 
89
- <div class="l-body">
90
- <figure><img src="assets/images/saturation.png"/></figure>
91
- <div id="saturation"></div>
 
92
  </div>
93
 
94
  <ol>
@@ -138,20 +142,24 @@
138
  </ul>
139
  </ol>
140
 
141
- <p><em>Should we have included more evaluations?</em></p>
 
142
 
143
- <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. There are many other evaluations which we wanted to include (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
144
- <p>But selecting new benchmarks is not the whole story, we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
 
 
145
 
146
  <h3>Reporting a fairer average for ranking: using normalized scores</h3>
147
  <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
148
 
149
- <div class="l-body">
150
  <!--todo: if you use an interactive visualisation instead of a plot,
151
  replace the class `l-body` by `main-plot-container` and import your interactive plot in the
152
  below div id, while leaving the image as such. -->
153
- <figure><img src="assets/images/normalized_vs_raw_scores.png"/></figure>
154
- <div id="normalisation"></div>
 
155
  </div>
156
 
157
  <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
@@ -183,7 +191,7 @@
183
  </li>
184
  </ul>
185
  <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
186
- <p>In this list, you’ll find LLMs from model creators who spent time and care on creating and delivering new cool models. We include big companies like Meta or Google, startups like Cohere or Mistral, collectives, like EleutherAI or NousResearch, and users, among many others.</p>
187
  <p>This list will be evolutive based on community suggestions, and will aim to include SOTA LLMs as they come out. We will also try to evaluate these models in priority, as they are more valuable to the community.</p>
188
  <p>We hope it will also make it easier for non ML users to better make their choice among the many, many models we evaluate.</p>
189
 
@@ -267,9 +275,11 @@
267
 
268
  <p>We also provide the most important top and bottom ranking changes.</p>
269
 
270
- <div class="l-body">
271
  <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
272
- <div id="ranking"></div>
 
 
273
  </div>
274
 
275
  <h3>Which evaluations should you pay most attention to?</h3>
@@ -277,9 +287,11 @@
277
 
278
  <p>For example, our different evaluations results are not all correlated with one another, which is expected.</p>
279
 
280
- <div class="l-body">
281
  <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
282
- <div id="heatmap"></div>
 
 
283
  </div>
284
 
285
  <p>MMLU-Pro, BBH and ARC-challenge are well correlated together. It is known that these 3 are well correlated with human preference (as they tend to align with human judgment on LMSys’s chatbot arena).</p>
@@ -393,6 +405,40 @@
393
  }
394
  });
395
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
396
  </script>
397
  </body>
398
- </html>
 
78
 
79
  <p>Here is why we think a new leaderboard was needed 👇</p>
80
 
81
+
82
  <h2>Harder, better, faster, stronger: Introducing the Leaderboard v2</h2>
83
 
84
+
85
+
86
  <h3>The need for a more challenging leaderboard</h3>
87
 
88
  <p>
89
  Over the past year, the benchmarks we were using got overused/saturated:
90
  </p>
91
 
92
+ <div class="main-plot-container">
93
+ <div id="saturation">
94
+ <iframe src="assets/scripts/plot.html" title="description", height="500" width="90%", style="border:none;"></iframe></p>
95
+ </div>
96
  </div>
97
 
98
  <ol>
 
142
  </ul>
143
  </ol>
144
 
145
+ <aside>
146
+ <p><em>Should we have included more evaluations?</em></p>
147
 
148
+ <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. There are many other evaluations which we wanted to include (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
149
+ </aside>
150
+
151
+ <p>Selecting new benchmarks is not the whole story, and we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
152
 
153
  <h3>Reporting a fairer average for ranking: using normalized scores</h3>
154
  <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
155
 
156
+ <div class="main-plot-container">
157
  <!--todo: if you use an interactive visualisation instead of a plot,
158
  replace the class `l-body` by `main-plot-container` and import your interactive plot in the
159
  below div id, while leaving the image as such. -->
160
+ <div id="normalisation">
161
+ <iframe src="assets/scripts/normalized_vs_raw.html" title="description", height="500" width="90%", style="border:none;"></iframe>
162
+ </div>
163
  </div>
164
 
165
  <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
 
191
  </li>
192
  </ul>
193
  <p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
194
+ <p>In this list, you’ll find LLMs from model creators who spent time, care, and a lot of compute on creating and delivering new cool models, from Meta and Google to Cohere or Mistral, as well as collectives like EleutherAI or NousResearch and community users.</p>
195
  <p>This list will be evolutive based on community suggestions, and will aim to include SOTA LLMs as they come out. We will also try to evaluate these models in priority, as they are more valuable to the community.</p>
196
  <p>We hope it will also make it easier for non ML users to better make their choice among the many, many models we evaluate.</p>
197
 
 
275
 
276
  <p>We also provide the most important top and bottom ranking changes.</p>
277
 
278
+ <div class="main-plot-container">
279
  <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
280
+ <div id="ranking">
281
+ <iframe src="assets/scripts/rankings_change.html" title="description", height="800" width="100%", style="border:none;"></iframe>
282
+ </div>
283
  </div>
284
 
285
  <h3>Which evaluations should you pay most attention to?</h3>
 
287
 
288
  <p>For example, our different evaluations results are not all correlated with one another, which is expected.</p>
289
 
290
+ <div class="main-plot-container">
291
  <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
292
+ <div id="heatmap">
293
+ <iframe src="assets/scripts/correlation_heatmap.html" title="description", height="500" width="100%", style="border:none;"></iframe>
294
+ </div>
295
  </div>
296
 
297
  <p>MMLU-Pro, BBH and ARC-challenge are well correlated together. It is known that these 3 are well correlated with human preference (as they tend to align with human judgment on LMSys’s chatbot arena).</p>
 
405
  }
406
  });
407
  }
408
+ function includeHTML() {
409
+ var z, i, elmnt, file, xhttp;
410
+ /* Loop through a collection of all HTML elements: */
411
+ z = document.getElementsByTagName("*");
412
+ for (i = 0; i < z.length; i++) {
413
+ elmnt = z[i];
414
+ /*search for elements with a certain atrribute:*/
415
+ file = elmnt.getAttribute("w3-include-html");
416
+ /* print the file on the console */
417
+ console.log("HELP");
418
+ console.log(file);
419
+
420
+ if (file) {
421
+ /* Make an HTTP request using the attribute value as the file name: */
422
+ xhttp = new XMLHttpRequest();
423
+ xhttp.onreadystatechange = function() {
424
+ if (this.readyState == 4) {
425
+ if (this.status == 200) {elmnt.innerHTML = this.responseText;}
426
+ if (this.status == 404) {elmnt.innerHTML = "Page not found.";}
427
+ /* Remove the attribute, and call this function once more: */
428
+ elmnt.removeAttribute("w3-include-html");
429
+ includeHTML();
430
+ }
431
+ }
432
+ xhttp.open("GET", file, true);
433
+ xhttp.send();
434
+ /* Exit the function: */
435
+ return;
436
+ }
437
+ }
438
+ }
439
+ </script>
440
+ <script>
441
+ includeHTML();
442
  </script>
443
  </body>
444
+ </html>