Spaces:
Running
Running
Clémentine
commited on
Commit
•
b76606d
1
Parent(s):
7f9759c
small typo
Browse files- dist/index.html +1 -1
- src/index.html +1 -1
dist/index.html
CHANGED
@@ -163,7 +163,7 @@
|
|
163 |
</div>
|
164 |
|
165 |
<p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
|
166 |
-
<p>On the above figure, we plot the mean scores for our evaluations, with normalized
|
167 |
<p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
|
168 |
|
169 |
<h3>Easier reproducibility: updating the evaluation suite</h3>
|
|
|
163 |
</div>
|
164 |
|
165 |
<p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
|
166 |
+
<p>On the above figure, we plot the mean scores for our evaluations, with normalized scores on the left, and raw scores on the right. If you take a look at the right hand side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our 2 hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!</p>
|
167 |
<p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
|
168 |
|
169 |
<h3>Easier reproducibility: updating the evaluation suite</h3>
|
src/index.html
CHANGED
@@ -163,7 +163,7 @@
|
|
163 |
</div>
|
164 |
|
165 |
<p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
|
166 |
-
<p>On the above figure, we plot the mean scores for our evaluations, with normalized
|
167 |
<p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
|
168 |
|
169 |
<h3>Easier reproducibility: updating the evaluation suite</h3>
|
|
|
163 |
</div>
|
164 |
|
165 |
<p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
|
166 |
+
<p>On the above figure, we plot the mean scores for our evaluations, with normalized scores on the left, and raw scores on the right. If you take a look at the right hand side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our 2 hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!</p>
|
167 |
<p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
|
168 |
|
169 |
<h3>Easier reproducibility: updating the evaluation suite</h3>
|