Spaces:

open-llm-leaderboard
/

blog

Running

Clémentine commited on 10 days ago

Commit

d80af64

•

1 Parent(s): 05d8ce4

removed arc

Files changed (2) hide show

dist/index.html CHANGED Viewed

@@ -123,7 +123,7 @@
                 <li>Evaluation quality:</li>
                 <ul>
                     <li>Human review of dataset: MMLU-Pro and GPQA</li>
-                    <li>Widespread use in the academic and/or open source community: ARC, BBH, IFeval, MATH</li>
                 </ul>
                 <li>Reliability and fairness of metrics:</li>
                 <ul>
@@ -137,7 +137,7 @@
                 </ul>
                 <li>Measuring model skills that are interesting for the community: </li>
                 <ul>
-                    <li>Correlation with human preferences: BBH, IFEval, ARC</li>
                     <li>Evaluation of a specific capability we are interested in: MATH, MuSR</li>
                 </ul>
             </ol>
@@ -305,7 +305,7 @@
                 </div>
             </div>
-            <p>As you can see, MMLU-Pro, BBH and ARC-challenge are rather well correlated. As it’s been also noted by other teams, these 3 benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
             <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
             <div class="main-plot-container">

                 <li>Evaluation quality:</li>
                 <ul>
                     <li>Human review of dataset: MMLU-Pro and GPQA</li>
+                    <li>Widespread use in the academic and/or open source community: BBH, IFeval, MATH</li>
                 </ul>
                 <li>Reliability and fairness of metrics:</li>
                 <ul>
                 </ul>
                 <li>Measuring model skills that are interesting for the community: </li>
                 <ul>
+                    <li>Correlation with human preferences: BBH, IFEval</li>
                     <li>Evaluation of a specific capability we are interested in: MATH, MuSR</li>
                 </ul>
             </ol>
                 </div>
             </div>
+            <p>As you can see, MMLU-Pro and BBH are rather well correlated. As it’s been also noted by other teams, these benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
             <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
             <div class="main-plot-container">

src/index.html CHANGED Viewed

@@ -123,7 +123,7 @@
                 <li>Evaluation quality:</li>
                 <ul>
                     <li>Human review of dataset: MMLU-Pro and GPQA</li>
-                    <li>Widespread use in the academic and/or open source community: ARC, BBH, IFeval, MATH</li>
                 </ul>
                 <li>Reliability and fairness of metrics:</li>
                 <ul>
@@ -137,7 +137,7 @@
                 </ul>
                 <li>Measuring model skills that are interesting for the community: </li>
                 <ul>
-                    <li>Correlation with human preferences: BBH, IFEval, ARC</li>
                     <li>Evaluation of a specific capability we are interested in: MATH, MuSR</li>
                 </ul>
             </ol>
@@ -305,7 +305,7 @@
                 </div>
             </div>
-            <p>As you can see, MMLU-Pro, BBH and ARC-challenge are rather well correlated. As it’s been also noted by other teams, these 3 benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
             <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
             <div class="main-plot-container">

                 <li>Evaluation quality:</li>
                 <ul>
                     <li>Human review of dataset: MMLU-Pro and GPQA</li>
+                    <li>Widespread use in the academic and/or open source community: BBH, IFeval, MATH</li>
                 </ul>
                 <li>Reliability and fairness of metrics:</li>
                 <ul>
                 </ul>
                 <li>Measuring model skills that are interesting for the community: </li>
                 <ul>
+                    <li>Correlation with human preferences: BBH, IFEval</li>
                     <li>Evaluation of a specific capability we are interested in: MATH, MuSR</li>
                 </ul>
             </ol>
                 </div>
             </div>
+            <p>As you can see, MMLU-Pro and BBH are rather well correlated. As it’s been also noted by other teams, these benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
             <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
             <div class="main-plot-container">