Normalization for MMLU-Pro doesn't make sense

#947
by ekurtic - opened

Hi folks, I believe the way that MMLU-Pro scores are normalized is not correct. At the moment, normalization is done with assumption that there are 10 choices (therefore random baseline is 1/10), which is also what the MMLU-Pro paper claims to be. But after briefly inspecting test-set of MMLU-Pro (https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro/viewer/default/test), one can notice that only 83% of questions have 10 choices (see "options" histogram in the HF dataset viewer). The other 17% of questions have anywhere from 3 to 9 choices.

Open LLM Leaderboard org

Hi @ekurtic ,

It's a very interesting question! Let us think about it and we will get back to you

Open LLM Leaderboard org

You're right – about 17% of MMLU-Pro questions have fewer than ten options. We chose to use a ten-option normalisation as a practical way to maintain consistency, even though it doesn’t perfectly fit every case

If you have the time, we would welcome your thoughts on how to improve our normalisation calculations. How would you approach correcting MMLU-Pro normalisation?

EDIT: [Hi @alozowski , I still think that normalization of scores is the right approach. It's just that we should normalize per-question with 1/num_choices_for_that_question rather than normalizing the global score with 1/10.]
As for the implementation part, I haven't been able to find where this normalization step is implemented for OpenLLM Leaderboard. It is certainly not part of https://github.com/huggingface/lm-evaluation-harness/tree/adding_all_changess

^^I wanted to propose that we normalize per-group of questions that have the same number of choices.

So we would first group all questions with N choices, normalize their average score with 1/N. And we would do this for each N \in [1, 10].
Then we can compute an average of these grouped-by-N averages and report it as an overall score on MMLU-Pro. This way we make sure that questions with varying number of answers are normalized in the same way.

Open LLM Leaderboard org

That's a good approach, thank you! I think we can implement this in the upcoming release, as well as add other fixes to results calculations

Thanks for taking the time to address the issue @alozowski !

Open LLM Leaderboard org

Let me leave this discussion open so that we take it into account in the next issue of Leaderboard - feel free share any other ideas for score normalisation here if you want to

After that the differently [#N-choice question] scores are weighted, depending on the number of those questions

?: Is it better to give

  1. more weight (multiplier) to the harder (more choices) questions (& by how much), or
  2. to keep it flat (the same & fair) for all questions ?

(i'm guessing the 2nd option) ...

@CombinHorizon I think 2nd option, as that is also used in other benchmarks like GPQA, BBH, etc.
@alozowski also, I wanted to mention that if the codebase for normalization is available somewhere, I could push a PR for this (in case you folks don't have spare cycles to work on it right now).

Open LLM Leaderboard org

Hi @ekurtic ,

We're quite busy at the moment, so your help will be very welcome, thanks! The scores normalization is done during the results parsing process and the parser code is private for now, as we plan to refactor it. But we have just added a Colab notebook that you can copy and make changes there (link to the normalization doc)

Open LLM Leaderboard org

Closing for inactivity, feel free to reopen if needed!

clefourrier changed discussion status to closed

Sign up or log in to comment