Nathan Habib commited on
Commit
75a42bb
โ€ข
1 Parent(s): 2d1ad89
dist/correlation_heatmap.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/index.html CHANGED
@@ -107,10 +107,7 @@
107
  <p>We decided to cover the following general tasks: knowledge testing (๐Ÿ“š), reasoning on short and long contexts (๐Ÿ’ญ), complex mathematical abilities, and tasks well correlated with human preference (๐Ÿค), like instruction following.</p>
108
  <p>We cover these tasks with 6 benchmarks. Let us present them briefly:</p>
109
 
110
- <p>๐Ÿ“š <strong>MMLU-Pro</strong> (Massive Multitask Language Understanding - Pro version, <a href="https://arxiv.org/abs/2406.01574">paper</a>). MMLU-Pro is a refined version of the MMLU dataset. MMLU has been the reference multichoice knowledge dataset. However, recent research showed that it is both noisy (some questions are unanswerable) and now too easy (through the evolution of model capabilities as well as the increase of contamination). MMLU-Pro presents the models with 10 choices instead of 4, requires reasoning on more questions, and has been expertly reviewed to reduce the amount of noise. It is higher quality than the original, and (for the moment) harder.
111
- </p>
112
- <iframe src="https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro/viewer" title="description", height="500" width="90%", style="border:none;"></iframe>
113
-
114
  <p>๐Ÿ“š <strong>GPQA</strong> (Google-Proof Q&amp;A Benchmark, <a href="https://arxiv.org/abs/2311.12022">paper</a>). GPQA is an extremely hard knowledge dataset, where questions were designed by domain experts in their field (PhD-level in biology, physics, chemistry, โ€ฆ) to be hard to answer by laypersons, but (relatively) easy for experts. Questions have gone through several rounds of validation to ensure both difficulty and factuality. The dataset is also only accessible through gating mechanisms, which should reduce the risks of contamination. (This is also why we donโ€™t provide a plain text example from this dataset, as requested by the authors in the paper).</p>
115
  <p><strong>MuSR</strong> (Multistep Soft Reasoning, <a href="https://arxiv.org/abs/2310.16049">paper</a>). MuSR is a very fun new dataset, made of algorithmically generated complex problems of around 1K words in length. Problems are either murder mysteries, object placement questions, or team allocation optimizations. To solve these, the models must combine reasoning and very long range context parsing. Few models score better than random performance.</p>
116
  <p>๐Ÿงฎ <strong>MATH</strong> (Mathematics Aptitude Test of Heuristics, Level 5 subset, <a href="https://arxiv.org/abs/2103.03874">paper</a>). MATH is a compilation of high-school level competition problems gathered from several sources, formatted consistently using Latex for equations and Asymptote for figures. Generations must fit a very specific output format. We keep only the hardest questions.</p>
@@ -156,7 +153,7 @@
156
  <!--todo: if you use an interactive visualisation instead of a plot,
157
  replace the class `l-body` by `main-plot-container` and import your interactive plot in the
158
  below div id, while leaving the image as such. -->
159
- <figure><img src="assets/images/normalized_vs_raw_scores.png"/></figure>
160
  <div id="normalisation"></div>
161
  </div>
162
 
 
107
  <p>We decided to cover the following general tasks: knowledge testing (๐Ÿ“š), reasoning on short and long contexts (๐Ÿ’ญ), complex mathematical abilities, and tasks well correlated with human preference (๐Ÿค), like instruction following.</p>
108
  <p>We cover these tasks with 6 benchmarks. Let us present them briefly:</p>
109
 
110
+ <p>๐Ÿ“š <strong>MMLU-Pro</strong> (Massive Multitask Language Understanding - Pro version, <a href="https://arxiv.org/abs/2406.01574">paper</a>). MMLU-Pro is a refined version of the MMLU dataset. MMLU has been the reference multichoice knowledge dataset. However, recent research showed that it is both noisy (some questions are unanswerable) and now too easy (through the evolution of model capabilities as well as the increase of contamination). MMLU-Pro presents the models with 10 choices instead of 4, requires reasoning on more questions, and has been expertly reviewed to reduce the amount of noise. It is higher quality than the original, and (for the moment) harder.</p>
 
 
 
111
  <p>๐Ÿ“š <strong>GPQA</strong> (Google-Proof Q&amp;A Benchmark, <a href="https://arxiv.org/abs/2311.12022">paper</a>). GPQA is an extremely hard knowledge dataset, where questions were designed by domain experts in their field (PhD-level in biology, physics, chemistry, โ€ฆ) to be hard to answer by laypersons, but (relatively) easy for experts. Questions have gone through several rounds of validation to ensure both difficulty and factuality. The dataset is also only accessible through gating mechanisms, which should reduce the risks of contamination. (This is also why we donโ€™t provide a plain text example from this dataset, as requested by the authors in the paper).</p>
112
  <p><strong>MuSR</strong> (Multistep Soft Reasoning, <a href="https://arxiv.org/abs/2310.16049">paper</a>). MuSR is a very fun new dataset, made of algorithmically generated complex problems of around 1K words in length. Problems are either murder mysteries, object placement questions, or team allocation optimizations. To solve these, the models must combine reasoning and very long range context parsing. Few models score better than random performance.</p>
113
  <p>๐Ÿงฎ <strong>MATH</strong> (Mathematics Aptitude Test of Heuristics, Level 5 subset, <a href="https://arxiv.org/abs/2103.03874">paper</a>). MATH is a compilation of high-school level competition problems gathered from several sources, formatted consistently using Latex for equations and Asymptote for figures. Generations must fit a very specific output format. We keep only the hardest questions.</p>
 
153
  <!--todo: if you use an interactive visualisation instead of a plot,
154
  replace the class `l-body` by `main-plot-container` and import your interactive plot in the
155
  below div id, while leaving the image as such. -->
156
+ <iframe src="normalized_vs_raw.html" title="description", height="500" width="90%", style="border:none;"></iframe></p>
157
  <div id="normalisation"></div>
158
  </div>
159
 
dist/normalized_vs_raw.html ADDED
The diff for this file is too large to render. See raw diff
 
dist/rankings_change.html ADDED
The diff for this file is too large to render. See raw diff