Spaces:

open-llm-leaderboard
/

blog

Running

App Files Files Community

Clémentine commited on 10 days ago

Commit

b76606d

•

1 Parent(s): 7f9759c

small typo

Browse files

Files changed (2) hide show

dist/index.html +1 -1
src/index.html +1 -1

dist/index.html CHANGED Viewed

@@ -163,7 +163,7 @@
             </div>
             <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
-            <p>On the above figure, we plot the mean scores for our evaluations, with normalized scoreon the right, and raw scores on the left. If you take a look at the right hand side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our 2 hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!</p>
             <p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
         <h3>Easier reproducibility: updating the evaluation suite</h3>

             </div>
             <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
+            <p>On the above figure, we plot the mean scores for our evaluations, with normalized scores on the left, and raw scores on the right. If you take a look at the right hand side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our 2 hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!</p>
             <p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
         <h3>Easier reproducibility: updating the evaluation suite</h3>

src/index.html CHANGED Viewed

@@ -163,7 +163,7 @@
             </div>
             <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
-            <p>On the above figure, we plot the mean scores for our evaluations, with normalized scoreon the right, and raw scores on the left. If you take a look at the right hand side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our 2 hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!</p>
             <p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
         <h3>Easier reproducibility: updating the evaluation suite</h3>

             </div>
             <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
+            <p>On the above figure, we plot the mean scores for our evaluations, with normalized scores on the left, and raw scores on the right. If you take a look at the right hand side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our 2 hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!</p>
             <p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
         <h3>Easier reproducibility: updating the evaluation suite</h3>