Clémentine commited on
Commit
b76606d
1 Parent(s): 7f9759c

small typo

Browse files
Files changed (2) hide show
  1. dist/index.html +1 -1
  2. src/index.html +1 -1
dist/index.html CHANGED
@@ -163,7 +163,7 @@
163
  </div>
164
 
165
  <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
166
- <p>On the above figure, we plot the mean scores for our evaluations, with normalized scoreon the right, and raw scores on the left. If you take a look at the right hand side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our 2 hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!</p>
167
  <p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
168
 
169
  <h3>Easier reproducibility: updating the evaluation suite</h3>
 
163
  </div>
164
 
165
  <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
166
+ <p>On the above figure, we plot the mean scores for our evaluations, with normalized scores on the left, and raw scores on the right. If you take a look at the right hand side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our 2 hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!</p>
167
  <p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
168
 
169
  <h3>Easier reproducibility: updating the evaluation suite</h3>
src/index.html CHANGED
@@ -163,7 +163,7 @@
163
  </div>
164
 
165
  <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
166
- <p>On the above figure, we plot the mean scores for our evaluations, with normalized scoreon the right, and raw scores on the left. If you take a look at the right hand side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our 2 hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!</p>
167
  <p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
168
 
169
  <h3>Easier reproducibility: updating the evaluation suite</h3>
 
163
  </div>
164
 
165
  <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
166
+ <p>On the above figure, we plot the mean scores for our evaluations, with normalized scores on the left, and raw scores on the right. If you take a look at the right hand side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our 2 hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!</p>
167
  <p>This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.</p>
168
 
169
  <h3>Easier reproducibility: updating the evaluation suite</h3>