diff --git "a/dist/index.html" "b/dist/index.html" --- "a/dist/index.html" +++ "b/dist/index.html" @@ -2,19 +2,18 @@ - - + - Performances are plateauing, let's make the leaderboard steep again + Open-LLM performances are plateauing, let’s make it steep again -

Performances are plateauing, let's make the leaderboard steep again

- -
-
- Banner -
-
+

Open-LLM performances are plateauing, let’s make it steep again

-

Evaluating and comparing LLMs is hard. Our RLHF team realized this a year ago when they wanted to reproduce and compare results from several published models. - It was a nearly impossible task: scores in papers or marketing releases were given without any reproducible code, sometimes doubtful, but in most cases, - just using optimized prompts or evaluation setup to give the best chances to the models. They therefore decided to create a place where reference models would be - evaluated in the exact same setup (same questions, asked in the same order, etc.) to gather completely reproducible and comparable results; and that’s how the +

Evaluating and comparing LLMs is hard. Our RLHF team realized this a year ago, when they wanted to reproduce and compare results from several published models. + It was a nearly impossible task: scores in papers or marketing releases were given without any reproducible code, sometimes doubtful but most of the case, + just using optimized prompts or evaluation setup to give best chances to the models. They therefore decided to create a place where reference models would be + evaluated in the exact same setup (same questions, asked in the same order, …), to gather completely reproducible and comparable results; and that’s how the Open LLM Leaderboard was born!

-

Following a series of highly visible model releases, it became a widely used resource in the ML community and beyond, visited by more than 2 million unique people over the last 10 months.

+

Following a series of highly-visible model releases, it became a widely used resource in the ML community and beyond, visited by more than 2 million unique people over the last 10 months.

-

Around 300,000 community members use and collaborate on it monthly through submissions and discussions, usually to:

+

We estimate that around 300 000 community members use and collaborate on it monthly through submissions and discussions; usually to:

-

However, with success, both in the leaderboard and the increasing performances of the models came challenges. After one intense year and a lot of community feedback, we thought it was time for an upgrade! Therefore, we’re introducing the Open LLM Leaderboard v2!

+

However, with success, both in the leaderboard and the increasing performances of the models came challenges and after one intense year and a lot of community feedback, we thought it was time for an upgrade! Therefore, we’re introducing the Open LLM Leaderboard v2!

+ +

Here is why we think a new leaderboard was needed 👇

+ + +

Harder, better, faster, stronger: Introducing the Leaderboard v2

-

Here is why we think a new leaderboard is needed 👇

-

Harder, better, faster, stronger: Introducing the LLM Leaderboard v2

The need for a more challenging leaderboard

@@ -90,37 +86,34 @@
-
-
+

    -
  1. They became too easy for models. For instance, models are now reaching baseline human performance on HellaSwag, MMLU, and ARC, a phenomenon called saturation.
  2. -
  3. Some newer models also showed signs of contamination. By this, we mean that models were possibly trained on benchmark data or on data very similar to benchmark data. As such, some scores stopped reflecting the general performance of the model and started to overfit on some evaluation datasets instead of reflecting the more general performance of the task being tested. This was, in particular, the case for GSM8K and TruthfulQA, which were included in some instruction fine-tuning sets.
  4. -
  5. Some benchmarks contained errors. MMLU was recently investigated in depth by several groups (see MMLU-Redux and MMLU-Pro), which surfaced mistakes in its responses and proposed new versions. Another example was that GSM8K used a specific end-of-generation token (:), which unfairly pushed down the performance of many verbose models.
  6. +
  7. They became too easy for models. For instance on HellaSwag, MMLU and ARC, models are now reaching baseline human performance, a phenomenon called saturation.
  8. +
  9. Some newer models also showed signs of contamination. By this we mean that models were possibly trained on benchmark data or on data very similar to benchmark data. As such, some scores stopped reflecting general performances of model and started to over-fit on some evaluation dataset instead of being reflective of the more general performances of the task being tested. This was in particular the case for GSM8K and TruthfulQA which were included in some instruction fine-tuning sets.
  10. +
  11. Some benchmarks contained errors: MMLU was recently investigated in depth by several groups who surfaced mistakes in its responses and proposed new versions. Another example was the fact that GSM8K used some specific end of generation token (:) which unfairly pushed down performance of many verbose models.

We thus chose to completely change the evaluations we are running for the Open LLM Leaderboard v2!

Rebooting our evaluation selection

-

We started looking for new benchmarks with uncontaminated, high-quality datasets, using reliable metrics and measuring model capabilities of interest.

+

We started looking for new benchmarks with uncontaminated, high quality datasets, making use of reliable metrics, and measuring model capabilities of interest.

We decided to cover the following general tasks: knowledge testing (📚), reasoning on short and long contexts (💭), complex mathematical abilities, and tasks well correlated with human preference (🤝), like instruction following.

-

We cover these tasks with six benchmarks. Let us present them briefly:

+

We cover these tasks with 6 benchmarks. Let us present them briefly:

-

📚 MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper). MMLU-Pro is a refined version of the MMLU dataset. MMLU has been the reference multichoice knowledge dataset. However, recent research showed that it is both noisy (some questions are unanswerable) and now too easy (through the evolution of model capabilities and increased contamination). MMLU-Pro presents the models with ten choices instead of 4, requires reasoning on more questions, and has been expertly reviewed to reduce the amount of noise. It is of higher quality than the original and harder.

-

📚 GPQA (Google-Proof Q&A Benchmark, paper). GPQA is an extremely hard knowledge dataset, where questions were designed by domain experts in their field (PhD-level in biology, physics, chemistry, etc.) to be hard to answer by laypersons but (relatively) easy for experts. Questions have gone through several rounds of validation to ensure both difficulty and factuality. The dataset is also only accessible through gating mechanisms, which should reduce contamination risks. (This is also why we don’t provide a plain text example from this dataset, as requested by the authors in the paper).

-

💭MuSR (Multistep Soft Reasoning, paper). MuSR is a very fun new dataset made of algorithmically generated complex problems of around 1K words in length. The problems are either murder mysteries, object placement questions, or team allocation optimizations. To solve these, the models must combine reasoning and very long-range context parsing. Few models score better than random performance.

-

🧮 MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper). MATH is a compilation of high-school-level competition problems gathered from several sources, formatted consistently using Latex for equations and Asymptote for figures. Generations must fit a very specific output format. We keep only the hardest questions.

-

🤝 IFEval (Instruction Following Evaluation, paper). IFEval is a fairly interesting dataset that tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions rather than the actual contents generated, allowing strict and rigorous metrics to be used.

-

🧮 🤝 BBH (Big Bench Hard, paper). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, and 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, SVG for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been, on average, well correlated with human preference. We expect this dataset to provide exciting insights into specific capabilities which could interest people.

+

📚 MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper). MMLU-Pro is a refined version of the MMLU dataset. MMLU has been the reference multichoice knowledge dataset. However, recent research showed that it is both noisy (some questions are unanswerable) and now too easy (through the evolution of model capabilities as well as the increase of contamination). MMLU-Pro presents the models with 10 choices instead of 4, requires reasoning on more questions, and has been expertly reviewed to reduce the amount of noise. It is higher quality than the original, and (for the moment) harder.

+

📚 GPQA (Google-Proof Q&A Benchmark, paper). GPQA is an extremely hard knowledge dataset, where questions were designed by domain experts in their field (PhD-level in biology, physics, chemistry, …) to be hard to answer by laypersons, but (relatively) easy for experts. Questions have gone through several rounds of validation to ensure both difficulty and factuality. The dataset is also only accessible through gating mechanisms, which should reduce the risks of contamination. (This is also why we don’t provide a plain text example from this dataset, as requested by the authors in the paper).

+

MuSR (Multistep Soft Reasoning, paper). MuSR is a very fun new dataset, made of algorithmically generated complex problems of around 1K words in length. Problems are either murder mysteries, object placement questions, or team allocation optimizations. To solve these, the models must combine reasoning and very long range context parsing. Few models score better than random performance.

+

🧮 MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper). MATH is a compilation of high-school level competition problems gathered from several sources, formatted consistently using Latex for equations and Asymptote for figures. Generations must fit a very specific output format. We keep only the hardest questions.

+

🤝 IFEval (Instruction Following Evaluation, paper). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.

+

🧮 🤝 BBH (Big Bench Hard, paper). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.

- + - -

Why did we choose these subsets?

-

In summary, our criteria were:

+

In summary, our criterion were:

  1. Evaluation quality:
  2. Reliability and fairness of metrics:
  3. General absence of contamination in models as of today: