Nathan Habib commited on
Commit
d754c6a
1 Parent(s): 75a42bb
Files changed (1) hide show
  1. dist/index.html +4 -3
dist/index.html CHANGED
@@ -149,12 +149,13 @@
149
  <h3>Reporting a fairer average for ranking: using normalized scores</h3>
150
  <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
151
 
152
- <div class="l-body">
153
  <!--todo: if you use an interactive visualisation instead of a plot,
154
  replace the class `l-body` by `main-plot-container` and import your interactive plot in the
155
  below div id, while leaving the image as such. -->
156
- <iframe src="normalized_vs_raw.html" title="description", height="500" width="90%", style="border:none;"></iframe></p>
157
- <div id="normalisation"></div>
 
158
  </div>
159
 
160
  <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>
 
149
  <h3>Reporting a fairer average for ranking: using normalized scores</h3>
150
  <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
151
 
152
+ <div class="main-plot-container">
153
  <!--todo: if you use an interactive visualisation instead of a plot,
154
  replace the class `l-body` by `main-plot-container` and import your interactive plot in the
155
  below div id, while leaving the image as such. -->
156
+ <div id="normalisation">
157
+ <iframe src="normalized_vs_raw.html" title="description", height="500" width="90%", style="border:none;"></iframe></p>
158
+ </div>
159
  </div>
160
 
161
  <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>