Spaces:

open-llm-leaderboard
/

blog

Running

App Files Files Community

Nathan Habib commited on 11 days ago

Commit

2d1ad89

•

1 Parent(s): 305bd94

add dataset viever

Browse files

Files changed (1) hide show

dist/index.html +4 -1

dist/index.html CHANGED Viewed

@@ -107,7 +107,10 @@
             <p>We decided to cover the following general tasks: knowledge testing (📚), reasoning on short and long contexts (💭), complex mathematical abilities, and tasks well correlated with human preference (🤝), like instruction following.</p>
             <p>We cover these tasks with 6 benchmarks. Let us present them briefly:</p>
-            <p>📚 <strong>MMLU-Pro</strong> (Massive Multitask Language Understanding - Pro version, <a href="https://arxiv.org/abs/2406.01574">paper</a>). MMLU-Pro is a refined version of the MMLU dataset. MMLU has been the reference multichoice knowledge dataset. However, recent research showed that it is both noisy (some questions are unanswerable) and now too easy (through the evolution of model capabilities as well as the increase of contamination). MMLU-Pro presents the models with 10 choices instead of 4, requires reasoning on more questions, and has been expertly reviewed to reduce the amount of noise. It is higher quality than the original, and (for the moment) harder.</p>
             <p>📚 <strong>GPQA</strong> (Google-Proof Q&amp;A Benchmark, <a href="https://arxiv.org/abs/2311.12022">paper</a>). GPQA is an extremely hard knowledge dataset, where questions were designed by domain experts in their field (PhD-level in biology, physics, chemistry, …) to be hard to answer by laypersons, but (relatively) easy for experts. Questions have gone through several rounds of validation to ensure both difficulty and factuality. The dataset is also only accessible through gating mechanisms, which should reduce the risks of contamination. (This is also why we don’t provide a plain text example from this dataset, as requested by the authors in the paper).</p>
             <p><strong>MuSR</strong> (Multistep Soft Reasoning, <a href="https://arxiv.org/abs/2310.16049">paper</a>). MuSR is a very fun new dataset, made of algorithmically generated complex problems of around 1K words in length. Problems are either murder mysteries, object placement questions, or team allocation optimizations. To solve these, the models must combine reasoning and very long range context parsing. Few models score better than random performance.</p>
             <p>🧮 <strong>MATH</strong> (Mathematics Aptitude Test of Heuristics, Level 5 subset, <a href="https://arxiv.org/abs/2103.03874">paper</a>). MATH is a compilation of high-school level competition problems gathered from several sources, formatted consistently using Latex for equations and Asymptote for figures. Generations must fit a very specific output format. We keep only the hardest questions.</p>

             <p>We decided to cover the following general tasks: knowledge testing (📚), reasoning on short and long contexts (💭), complex mathematical abilities, and tasks well correlated with human preference (🤝), like instruction following.</p>
             <p>We cover these tasks with 6 benchmarks. Let us present them briefly:</p>
+            <p>📚 <strong>MMLU-Pro</strong> (Massive Multitask Language Understanding - Pro version, <a href="https://arxiv.org/abs/2406.01574">paper</a>). MMLU-Pro is a refined version of the MMLU dataset. MMLU has been the reference multichoice knowledge dataset. However, recent research showed that it is both noisy (some questions are unanswerable) and now too easy (through the evolution of model capabilities as well as the increase of contamination). MMLU-Pro presents the models with 10 choices instead of 4, requires reasoning on more questions, and has been expertly reviewed to reduce the amount of noise. It is higher quality than the original, and (for the moment) harder.
+            </p>
+            <iframe src="https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro/viewer" title="description", height="500" width="90%", style="border:none;"></iframe>
             <p>📚 <strong>GPQA</strong> (Google-Proof Q&amp;A Benchmark, <a href="https://arxiv.org/abs/2311.12022">paper</a>). GPQA is an extremely hard knowledge dataset, where questions were designed by domain experts in their field (PhD-level in biology, physics, chemistry, …) to be hard to answer by laypersons, but (relatively) easy for experts. Questions have gone through several rounds of validation to ensure both difficulty and factuality. The dataset is also only accessible through gating mechanisms, which should reduce the risks of contamination. (This is also why we don’t provide a plain text example from this dataset, as requested by the authors in the paper).</p>
             <p><strong>MuSR</strong> (Multistep Soft Reasoning, <a href="https://arxiv.org/abs/2310.16049">paper</a>). MuSR is a very fun new dataset, made of algorithmically generated complex problems of around 1K words in length. Problems are either murder mysteries, object placement questions, or team allocation optimizations. To solve these, the models must combine reasoning and very long range context parsing. Few models score better than random performance.</p>
             <p>🧮 <strong>MATH</strong> (Mathematics Aptitude Test of Heuristics, Level 5 subset, <a href="https://arxiv.org/abs/2103.03874">paper</a>). MATH is a compilation of high-school level competition problems gathered from several sources, formatted consistently using Latex for equations and Asymptote for figures. Generations must fit a very specific output format. We keep only the hardest questions.</p>