Clémentine commited on
Commit
d80af64
1 Parent(s): 05d8ce4

removed arc

Browse files
Files changed (2) hide show
  1. dist/index.html +3 -3
  2. src/index.html +3 -3
dist/index.html CHANGED
@@ -123,7 +123,7 @@
123
  <li>Evaluation quality:</li>
124
  <ul>
125
  <li>Human review of dataset: MMLU-Pro and GPQA</li>
126
- <li>Widespread use in the academic and/or open source community: ARC, BBH, IFeval, MATH</li>
127
  </ul>
128
  <li>Reliability and fairness of metrics:</li>
129
  <ul>
@@ -137,7 +137,7 @@
137
  </ul>
138
  <li>Measuring model skills that are interesting for the community: </li>
139
  <ul>
140
- <li>Correlation with human preferences: BBH, IFEval, ARC</li>
141
  <li>Evaluation of a specific capability we are interested in: MATH, MuSR</li>
142
  </ul>
143
  </ol>
@@ -305,7 +305,7 @@
305
  </div>
306
  </div>
307
 
308
- <p>As you can see, MMLU-Pro, BBH and ARC-challenge are rather well correlated. As it’s been also noted by other teams, these 3 benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
309
  <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
310
 
311
  <div class="main-plot-container">
 
123
  <li>Evaluation quality:</li>
124
  <ul>
125
  <li>Human review of dataset: MMLU-Pro and GPQA</li>
126
+ <li>Widespread use in the academic and/or open source community: BBH, IFeval, MATH</li>
127
  </ul>
128
  <li>Reliability and fairness of metrics:</li>
129
  <ul>
 
137
  </ul>
138
  <li>Measuring model skills that are interesting for the community: </li>
139
  <ul>
140
+ <li>Correlation with human preferences: BBH, IFEval</li>
141
  <li>Evaluation of a specific capability we are interested in: MATH, MuSR</li>
142
  </ul>
143
  </ol>
 
305
  </div>
306
  </div>
307
 
308
+ <p>As you can see, MMLU-Pro and BBH are rather well correlated. As it’s been also noted by other teams, these benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
309
  <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
310
 
311
  <div class="main-plot-container">
src/index.html CHANGED
@@ -123,7 +123,7 @@
123
  <li>Evaluation quality:</li>
124
  <ul>
125
  <li>Human review of dataset: MMLU-Pro and GPQA</li>
126
- <li>Widespread use in the academic and/or open source community: ARC, BBH, IFeval, MATH</li>
127
  </ul>
128
  <li>Reliability and fairness of metrics:</li>
129
  <ul>
@@ -137,7 +137,7 @@
137
  </ul>
138
  <li>Measuring model skills that are interesting for the community: </li>
139
  <ul>
140
- <li>Correlation with human preferences: BBH, IFEval, ARC</li>
141
  <li>Evaluation of a specific capability we are interested in: MATH, MuSR</li>
142
  </ul>
143
  </ol>
@@ -305,7 +305,7 @@
305
  </div>
306
  </div>
307
 
308
- <p>As you can see, MMLU-Pro, BBH and ARC-challenge are rather well correlated. As it’s been also noted by other teams, these 3 benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
309
  <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
310
 
311
  <div class="main-plot-container">
 
123
  <li>Evaluation quality:</li>
124
  <ul>
125
  <li>Human review of dataset: MMLU-Pro and GPQA</li>
126
+ <li>Widespread use in the academic and/or open source community: BBH, IFeval, MATH</li>
127
  </ul>
128
  <li>Reliability and fairness of metrics:</li>
129
  <ul>
 
137
  </ul>
138
  <li>Measuring model skills that are interesting for the community: </li>
139
  <ul>
140
+ <li>Correlation with human preferences: BBH, IFEval</li>
141
  <li>Evaluation of a specific capability we are interested in: MATH, MuSR</li>
142
  </ul>
143
  </ol>
 
305
  </div>
306
  </div>
307
 
308
+ <p>As you can see, MMLU-Pro and BBH are rather well correlated. As it’s been also noted by other teams, these benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
309
  <p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
310
 
311
  <div class="main-plot-container">