Spaces:
Running
on
CPU Upgrade
Re-evaluate Qwen2.5-72B-Instruct
Hi @ChuckMcSneed ,
Thank you for pointing it out!
The low MATH score for Qwen2.5-72B-Instruct
is due to the model not following the expected answer format: Final Answer: The final answer is(.*?). I hope it is correct
. As you can see in the evaluation details, even if mathematically correct, answers not matching this format aren't counted
Our current evaluation uses few-shot prompts (where the correct answer format is shown in preceding examples). This approach aligns with the Minerva implementation of MATH (arxiv paper) and assesses both mathematical ability and ability to do in context-learning. We expect high-capability models to be able to do in context-learning correctly and to follow a provided answer format, but we've observed that models which were instruction-tuned too much, like Qwen2.5
and Llama3.2
, were losing that capability.
However, we recognise the limitations of our current approach and we're considering loosening the format in future iterations to focus on true mathematical capabilities
Feel free to ask any questions!
Okay, that explains it, thanks.