MMLU doesn't match on lm-evaluation-harness

#2
by yixinsong - opened

I evaluate the 1.7B models with lm-evaluation-harness framework.
image.png

I am curious about what causes the performance difference between lighteval and lm-evaluation-harness?

yixinsong changed discussion status to closed

Same question.

Hugging Face TB Research org

Hi, we use a different implementation of MMLU: cloze version vs MC, where we consider the log probabilities of entire answer sequences, instead of just single letters. You can find more details about this in this blog post and in appendix G.2 of this paper.

To reproduce our results you can use the guidelines here: https://huggingface.co/HuggingFaceFW/ablation-model-fineweb-edu#evaluation

Sign up or log in to comment