Spaces:
Running
on
CPU Upgrade
Normalization for MMLU-Pro doesn't make sense
Hi folks, I believe the way that MMLU-Pro scores are normalized is not correct. At the moment, normalization is done with assumption that there are 10 choices (therefore random baseline is 1/10), which is also what the MMLU-Pro paper claims to be. But after briefly inspecting test-set of MMLU-Pro (https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro/viewer/default/test), one can notice that only 83% of questions have 10 choices (see "options" histogram in the HF dataset viewer). The other 17% of questions have anywhere from 3 to 9 choices.
You're right – about 17% of MMLU-Pro questions have fewer than ten options. We chose to use a ten-option normalisation as a practical way to maintain consistency, even though it doesn’t perfectly fit every case
If you have the time, we would welcome your thoughts on how to improve our normalisation calculations. How would you approach correcting MMLU-Pro normalisation?
EDIT: [Hi
@alozowski
, I still think that normalization of scores is the right approach. It's just that we should normalize per-question with 1/num_choices_for_that_question
rather than normalizing the global score with 1/10
.]
As for the implementation part, I haven't been able to find where this normalization step is implemented for OpenLLM Leaderboard. It is certainly not part of https://github.com/huggingface/lm-evaluation-harness/tree/adding_all_changess
^^I wanted to propose that we normalize per-group of questions that have the same number of choices.
So we would first group all questions with N choices, normalize their average score with 1/N. And we would do this for each N \in [1, 10].
Then we can compute an average of these grouped-by-N averages and report it as an overall score on MMLU-Pro. This way we make sure that questions with varying number of answers are normalized in the same way.
That's a good approach, thank you! I think we can implement this in the upcoming release, as well as add other fixes to results calculations
Thanks for taking the time to address the issue @alozowski !
Let me leave this discussion open so that we take it into account in the next issue of Leaderboard - feel free share any other ideas for score normalisation here if you want to
After that the differently [#N-choice question] scores are weighted, depending on the number of those questions
?: Is it better to give
- more weight (multiplier) to the harder (more choices) questions (& by how much), or
- to keep it flat (the same & fair) for all questions ?
(i'm guessing the 2nd option) ...
@CombinHorizon
I think 2nd option, as that is also used in other benchmarks like GPQA, BBH, etc.
@alozowski
also, I wanted to mention that if the codebase for normalization is available somewhere, I could push a PR for this (in case you folks don't have spare cycles to work on it right now).
Hi @ekurtic ,
We're quite busy at the moment, so your help will be very welcome, thanks! The scores normalization is done during the results parsing process and the parser code is private for now, as we plan to refactor it. But we have just added a Colab notebook that you can copy and make changes there (link to the normalization doc)
Closing for inactivity, feel free to reopen if needed!