Spaces:

open-llm-leaderboard
/

blog

Running

App Files Files Community

osanseviero HF staff commited on 9 days ago

Commit

5405ff7

•

1 Parent(s): fb5b21a

Update src/index.html

Browse files

Files changed (1) hide show

src/index.html +12 -12

src/index.html CHANGED Viewed

@@ -136,13 +136,13 @@
             <aside>
                 <p><em>Should we have included more evaluations?</em></p>
-                <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. We wanted to include many other evaluations (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
             </aside>
-            <p>Selecting new benchmarks is not the whole story, and we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
         <h3>Reporting a fairer average for ranking: using normalized scores</h3>
-            <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
             <div class="main-plot-container">
                 <!--todo: if you use an interactive visualisation instead of a plot,
@@ -153,23 +153,23 @@
                 </div>
             </div>
-            <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
-            <p>On the above figure, we plot the mean scores for our evaluations, with normalized scores on the left, and raw scores on the right. If you take a look at the right hand side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our 2 hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!</p>
-            <p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
         <h3>Easier reproducibility: updating the evaluation suite</h3>
-            <p>A year ago, we made the choice to use the Harness (lm-eval) from EleutherAI to power our evaluations. It provides a standard and stable implementation for a number of tasks. To ensure fairness and reproducibility, we pinned the version we were using, which allowed us to compare all models in an apples to apples setup, as all evaluations were run in exactly the same way, on the same hardware, using the same evaluation suite commit and parameters.</p>
-            <p>However, <code>lm-eval</code> evolved, and the implementation of some tasks or metrics changed, which led to discrepancies between 1) evaluation results people would get on more recent versions of the harness and 2) our results using our pinned version.</p>
             <p>For the new version of the Open LLM Leaderboard, we have therefore worked together with the amazing EleutherAI team (notably Hailey Schoelkopf, so many, huge kudos!) to update the harness.</p>
-            <p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
-            <p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
             <gradio-app src="https://open-llm-leaderboard-GenerationVisualizer.hf.space"></gradio-app>
-            <p>You can explore the visualiser we used here!</p>
             <p>This should allow us to keep our version up to date with new features added in the future!</p>
-            <p>Enough said on the leaderboard backend and metrics, now let’s turn to the models and model selection/submission.
     <h2>Focusing on the models most relevant to the community</h2>
         <h3>Introducing the <em>maintainer’s highlight</em></h3>

             <aside>
                 <p><em>Should we have included more evaluations?</em></p>
+                <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. We wanted to include many other evaluations (MTBench, AGIEval, DROP, etc.), but we are, in the end, still compute-constrained. So, to keep the evaluation budgets under control, we ranked evals according to our above criterion and kept the top-ranking benchmarks. This is also why we didn’t select any benchmark requiring using another model as a judge.</p>
             </aside>
+            <p>Selecting new benchmarks is not the whole story. We also made several other interesting improvements to the leaderboard, which we’ll now briefly cover.</p>
         <h3>Reporting a fairer average for ranking: using normalized scores</h3>
+            <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. For example, in a benchmark containing two choices for each question, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We, therefore, change the range so that a 50 on the raw score is a 0 on the normalized score. This does not change anything for generative evaluations like IFEval or MATH.</p>
             <div class="main-plot-container">
                 <!--todo: if you use an interactive visualisation instead of a plot,
                 </div>
             </div>
+            <p>This change is more significant than it may seem, as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
+            <p>In the figure above, we plot the mean scores for our evaluations, with normalized scores on the left and raw scores on the right. If you look at the right side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our two hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!</p>
+            <p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice, with two option samples. Model A gets 0 on the generative evaluation and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. Looking at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are, in fact, both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0, and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
         <h3>Easier reproducibility: updating the evaluation suite</h3>
+            <p>A year ago, we chose to use the Harness (lm-eval) from EleutherAI to power our evaluations. It provides a standard and stable implementation for several tasks. To ensure fairness and reproducibility, we pinned the version we were using. This allowed us to compare all models in an apples-to-apples setup, as all evaluations were run in exactly the same way, on the same hardware, using the same evaluation suite commit and parameters.</p>
+            <p>However, <code>lm-eval</code> evolved, and the implementation of some tasks or metrics changed, which led to discrepancies between 1) the evaluation results people would get on more recent versions of the harness and 2) our results using our pinned version.</p>
             <p>For the new version of the Open LLM Leaderboard, we have therefore worked together with the amazing EleutherAI team (notably Hailey Schoelkopf, so many, huge kudos!) to update the harness.</p>
+            <p>On the features side, we added the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
+            <p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly and fix the problems we observed with inconsistent few shot samples, too restrictive end-of-sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
             <gradio-app src="https://open-llm-leaderboard-GenerationVisualizer.hf.space"></gradio-app>
+            <p>You can explore the visualizer we used here!</p>
             <p>This should allow us to keep our version up to date with new features added in the future!</p>
+            <p>Enough was said on the leaderboard backend and metrics. Now, let’s turn to the models and model selection/submission.</p>
     <h2>Focusing on the models most relevant to the community</h2>
         <h3>Introducing the <em>maintainer’s highlight</em></h3>