Spaces:

open-llm-leaderboard
/

blog

Running

App Files Files Community

Nathan Habib commited on 10 days ago

Commit

d754c6a

•

1 Parent(s): 75a42bb

add plots

Browse files

Files changed (1) hide show

dist/index.html +4 -3

dist/index.html CHANGED Viewed

@@ -149,12 +149,13 @@
         <h3>Reporting a fairer average for ranking: using normalized scores</h3>
             <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
-            <div class="l-body">
                 <!--todo: if you use an interactive visualisation instead of a plot,
                     replace the class `l-body` by `main-plot-container` and import your interactive plot in the
                     below div id, while leaving the image as such. -->
-                <iframe src="normalized_vs_raw.html" title="description", height="500" width="90%", style="border:none;"></iframe></p>
-                <div id="normalisation"></div>
             </div>
             <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>

         <h3>Reporting a fairer average for ranking: using normalized scores</h3>
             <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
+            <div class="main-plot-container">
                 <!--todo: if you use an interactive visualisation instead of a plot,
                     replace the class `l-body` by `main-plot-container` and import your interactive plot in the
                     below div id, while leaving the image as such. -->
+                <div id="normalisation">
+                    <iframe src="normalized_vs_raw.html" title="description", height="500" width="90%", style="border:none;"></iframe></p>
+                </div>
             </div>
             <p>This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.</p>