open-llm-leaderboard/open_llm_leaderboard · Re-evaluate Qwen2.5-72B-Instruct

Oct 9

MATH for some reason is very low compared to base model and other Qwen family models. I suspect that something went wrong during evaluation.

alozowski

Open LLM Leaderboard org Oct 11

•

edited Oct 11

Hi @ChuckMcSneed ,

Thank you for pointing it out!

The low MATH score for Qwen2.5-72B-Instruct is due to the model not following the expected answer format: Final Answer: The final answer is(.*?). I hope it is correct. As you can see in the evaluation details, even if mathematically correct, answers not matching this format aren't counted

Our current evaluation uses few-shot prompts (where the correct answer format is shown in preceding examples). This approach aligns with the Minerva implementation of MATH (arxiv paper) and assesses both mathematical ability and ability to do in context-learning. We expect high-capability models to be able to do in context-learning correctly and to follow a provided answer format, but we've observed that models which were instruction-tuned too much, like Qwen2.5 and Llama3.2, were losing that capability.

However, we recognise the limitations of our current approach and we're considering loosening the format in future iterations to focus on true mathematical capabilities

Feel free to ask any questions!

ChuckMcSneed

Oct 12

Okay, that explains it, thanks.

ChuckMcSneed changed discussion status to closed Oct 12