This is an automated PR created with https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr

The purpose of this PR is to add evaluation results from the Open LLM Leaderboard to your model card.

If you encounter any issues, please report them to https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr/discussions

Thank you for sharing.

Some common models like MMLU typically use a 5-shot setting to measure a model's in-context learning capabilities.

Can you explain why MMLU evaluations use a zero-shot plus option content approach?

According to your blog, in this setup, MMLU evaluations are higher than those of QWen1.5B and Phi models, whereas in 5-shot evaluations, the conclusion is the opposite. Is this situation reasonable? Thank you.

Hugging Face TB Research org
edited about 1 month ago

The difference comes from the MMLU prompt implementation rather than 0-shot vs 5-shot. Each answer to an MMLU question has a letter from A to D, the leaderboard uses MCF (multiple-choice formulation) version where the model needs to return the letter corresponding to the right answer, whereas in the cloze version (that we use) we compute log probs over full answers not just single letters. Most small not instruction tuned models don't seem to have the ability to match answers to their corresponding letter and give an almost random score (0.25) when using MCF, so cloze version gives more signal.

In cloze version the models outperforms Qwen1.5B and Phi for both 0-shot and 5-shot, you can find the guidelines to reproduce our scores here: https://huggingface.co/HuggingFaceFW/ablation-model-fineweb-edu#evaluation

You can find more details about this in this blog post https://huggingface.co/blog/open-llm-leaderboard-mmlu#1001-flavors-of-mmlu and these papers https://arxiv.org/pdf/2406.08446 + appendix G.2 https://arxiv.org/pdf/2406.11794)

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment