Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1016

Normalization for MMLU-Pro doesn't make sense

#947

by ekurtic - opened Sep 25

Discussion

ekurtic

Sep 25

Hi folks, I believe the way that MMLU-Pro scores are normalized is not correct. At the moment, normalization is done with assumption that there are 10 choices (therefore random baseline is 1/10), which is also what the MMLU-Pro paper claims to be. But after briefly inspecting test-set of MMLU-Pro (https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro/viewer/default/test), one can notice that only 83% of questions have 10 choices (see "options" histogram in the HF dataset viewer). The other 17% of questions have anywhere from 3 to 9 choices.

alozowski

Open LLM Leaderboard org Sep 25

Hi @ekurtic ,

It's a very interesting question! Let us think about it and we will get back to you

alozowski

Open LLM Leaderboard org Sep 25

You're right – about 17% of MMLU-Pro questions have fewer than ten options. We chose to use a ten-option normalisation as a practical way to maintain consistency, even though it doesn’t perfectly fit every case

If you have the time, we would welcome your thoughts on how to improve our normalisation calculations. How would you approach correcting MMLU-Pro normalisation?

ekurtic

Sep 25

•

edited Sep 25

EDIT: [Hi @alozowski , I still think that normalization of scores is the right approach. It's just that we should normalize per-question with 1/num_choices_for_that_question rather than normalizing the global score with 1/10.]
As for the implementation part, I haven't been able to find where this normalization step is implemented for OpenLLM Leaderboard. It is certainly not part of https://github.com/huggingface/lm-evaluation-harness/tree/adding_all_changess

ekurtic

Sep 25

•

edited Sep 25

^^I wanted to propose that we normalize per-group of questions that have the same number of choices.

So we would first group all questions with N choices, normalize their average score with 1/N. And we would do this for each N \in [1, 10].
Then we can compute an average of these grouped-by-N averages and report it as an overall score on MMLU-Pro. This way we make sure that questions with varying number of answers are normalized in the same way.

alozowski

Open LLM Leaderboard org Sep 27

That's a good approach, thank you! I think we can implement this in the upcoming release, as well as add other fixes to results calculations

ekurtic

Sep 30

Thanks for taking the time to address the issue @alozowski !

alozowski

Open LLM Leaderboard org Sep 30

Let me leave this discussion open so that we take it into account in the next issue of Leaderboard - feel free share any other ideas for score normalisation here if you want to

CombinHorizon

Oct 1

After that the differently [#N-choice question] scores are weighted, depending on the number of those questions

?: Is it better to give

more weight (multiplier) to the harder (more choices) questions (& by how much), or
to keep it flat (the same & fair) for all questions ?

(i'm guessing the 2nd option) ...

ekurtic

Oct 2

@CombinHorizon I think 2nd option, as that is also used in other benchmarks like GPQA, BBH, etc.
@alozowski also, I wanted to mention that if the codebase for normalization is available somewhere, I could push a PR for this (in case you folks don't have spare cycles to work on it right now).

alozowski

Open LLM Leaderboard org Oct 3

Hi @ekurtic ,

We're quite busy at the moment, so your help will be very welcome, thanks! The scores normalization is done during the results parsing process and the parser code is private for now, as we plan to refactor it. But we have just added a Colab notebook that you can copy and make changes there (link to the normalization doc)

clefourrier

Open LLM Leaderboard org Oct 17

Closing for inactivity, feel free to reopen if needed!

clefourrier changed discussion status to closed Oct 17

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment