NEW! OpenLLMLeaderboard 2023 fall update

#356
by clefourrier HF staff - opened
Open LLM Leaderboard org

New-leaderboard-update (1).gif

We spent A YEAR of GPU time for the biggest update of the Open LLM Leaderboard yet! ๐Ÿคฏ

With @SaylorTwift , we added 3 new benchmark metrics from the great EleutherAI harness ๐Ÿ’ฅ and re-ran 2000+ models on them! ๐Ÿš€

๐Ÿค” Why?
Our initial evaluations were multiple-choice Q/A datasets:

  • ๐Ÿ“š MMLU, knowledge across many domains
  • ๐Ÿ“š ๐Ÿ‘ฉโ€๐Ÿ”ฌ ARC, high-school science knowledge
  • HellaSwag, choosing the plausible next steps of a list of actions.
  • ๐Ÿ“š๐Ÿ‘ป TruthfulQA, logical fallacies and knowledge bias

So... mostly knowledge and some reasoning.

But we wanted
๐Ÿ”ญ model creators to get more information on their models capabilities
๐Ÿ”Ž model users to select models on metrics relevant for them
โš– leaderboard rankings to be fairer

๐Ÿค” How?
We added 3 harder evaluations, on new capabilities!

  1. DROP ๐Ÿ’ง
    Questions on wikipedia paragraphs. It requires both 1) reading comprehension to extract relevant information and 2) reasoning steps (subtractions, additions, comparisons, counting or sorting, ...) to solve the questions. Many models struggle with it!
    Contrary to previous evals, it is generative: the model is not just looking at suggested choices, but must actually generate its own answers. That makes it more relevant to study the actual reasoning capabilities of models in unconstrained setups.

  2. GSM8K ๐Ÿงฎ
    Diverse grade school math problems. Math was a highly expected and requested new capability to study, with reason; current models have a lot of room to improve on math, and it's a very exciting research direction!

  3. WinoGrande ๐Ÿท
    Multiple-choice adversarial Winograd completion dataset.
    An example must be filled with one of two words - the model must select the most relevant one for the blank. The opposite word drastically changes the meaning of the sentence.
    It's a development of the historically significant Winograde Schema Challenge: for a long time one of the most difficult benchmark ever!

๐Ÿค” What about the rankings?

  • ๐Ÿ’ชPretrained models rankings almost did not change! โ†’ Good models stay good, no matter the evals ๐Ÿ…
  • ๐ŸŒ€ Fine tuned models saw many rerankings for 13B+, and IFT/RL models did not change that much, apart from Hermes models (โ†“) & Beluga/Trurl models (โ†‘) โ†’ We hope it'll help see which fine-tuning strategies are best across tasks!

๐Ÿค” Diving very deep in these benchmarks ๐Ÿ‘€
We've found interesting implementation questions (reminiscent of our blog post on MMLU: https://huggingface.co/blog/evaluating-mmlu-leaderboard).
Feel free to read more on it and join the discussion at https://github.com/EleutherAI/lm-evaluation-harness/issues/978 or here!

๐Ÿค— That's it!
We hope you find these new evals interesting, and learn more about your (favorite) models along the way!
Thank you very much for following the leaderboard. We'll keep on upgrading it so it stays a useful resource for the community, & further help model progress ๐Ÿš€

Many special thanks to:

  • @thomwolf for his insights and help ๐Ÿค—
  • all the researchers who developed these evaluation datasets ๐Ÿ”ฅ
  • the Eleuther AI team for their great work on the harness ๐Ÿš€
  • @Chunte for the gorgeous illustration โค๏ธ
clefourrier pinned discussion

Awesome โค๏ธ
Thank you all for doing this, and keeping us clued in on model performance ๐Ÿ‘

deleted
This comment has been hidden

Is there in the evaluations, one evaluation at least that check the language level of the model for various language?
I think it's very important for specific us in differents countries.
Thanks for this better leaderboard

Open LLM Leaderboard org

@Ostixe360 Hi!
We are planning on working on multilingual leaderboards with some partners in the following months, but this is only at very early stages
Being French myself, I 100% agree that we need to evaluate models on more than "just English" ๐Ÿ˜…

However, in the meantime, you can look at the Upstage leaderboard for Korean capabilities, and the Open Compass one for Chinese capabilities.

Regardless of evaluation results, currently it is pretty hard to find models in your language or based on certain specific criteria like maximum VRAM size. I personally also tried the full-text model search of Huggingface, but this seems to be quite inefficient unfortunately. While the original huggingface leaderboard does not allow you to filter by language, you can filter by it on this website: https://llm.extractum.io/list. Just left-click on the language column. It also queries the hugginface leaderboard average model score for most models. Of course, those scores might be skewed based on the english evaluation.

Thank both of you for your responses.

Does the eval use custom instructions best suited for the model? For some models, such as Yi, using their custom instruction usually produces way better results.

Open LLM Leaderboard org

Hi @asdaweqw12 !
No, we don't allow custom instructions yet, but we will add them soon!

I notice some TruthfulQA scores are missing.

Just sort by worst scores to show them.

Open LLM Leaderboard org
โ€ข
edited Nov 14, 2023

@HenkPoley Thank you for reporting! It was a display problem, should be fixed! :)

deleted
This comment has been hidden

Mistral Dolphin 7B seems to be missing. The 2.0 or 2.1 etc. Like deleted or made private or something? Who makes those kinds of decisions? I sometimes notice they may reappear again soon. They just being re-tested or was there a possible flaw in the benchmark etc? Thanks.

Open LLM Leaderboard org

@Goldenblood56 they should still be appearing when you select the "Show gated/deleted/..." checkbox - I'm investigating why. If they are not, please open a dedicated issue so we can keep track.

If ability of model to be used as an agent can be judged through a matric then please add that matric as well. How good they are in choosing correct tool for the task, parse their own output, adjusting output to be sent as input to next step.

If ability of model to be used as an agent can be judged through a matric then please add that matric as well. How good they are in choosing correct tool for the task, parse their own output, adjusting output to be sent as input to next step.

Great idea!

@clefourrier I've noticed an issue in the DROP implementation by EleutherAI using the commit hash b281b0921b636bc36ad05c0b0b0763bd6dd43463. By default, all models continue generating text until the first "." (see this line), so without any filtering, the F1 metrics are computed using overly lengthy generated texts. For example, Mistral 7B generates 10\n\nPassage: The 2006-07 season was the 10th season for the New Orleans Hornets in the National Basketball Association for the first dataset example. Considering typical LLM behaviors, we should filter the answer using Passage: and calculate the scores using 10 instead of the entire generated text (which is the correct answer by the way). Please see a similar filter used in GSM8K from EleutherAI here.

Open LLM Leaderboard org
โ€ข
edited Nov 30, 2023

@binhtang (and cc @Phil337 since you've been concerned about this too)
Thank you for your report!
We've spent the last weeks investigating the DROP scores in more detail, and found concerning things in the scoring - what you just highlighted is not the only problem in the DROP metrics.
We'll publish a blog post about it very soon, and update the leaderboard accordingly

If you visit https://llm.extractum.io/list/?lbonly, you'll find a comprehensive list of our top models, along with a whole host of other parameters and metrics. Clicking on a specific model allows you to delve deeper into its internals and parameters, including its performance on other benchmarks.

Open LLM Leaderboard org

@gregzem
Nice visualization! What do the icon next to the model names correspond to?

clefourrier unpinned discussion
clefourrier changed discussion status to closed

Sign up or log in to comment