Spaces:
Running
on
CPU Upgrade
Gemma-2-9B-it scores
I think something is wrong with Gemma-2-9B-it's MMLU-Pro score?
You can see in TIGER-Lab/MMLU-Pro that Gemma-2-9B-it slightly beats Phi3-Medium
But that is not the case on this leaderboard
here is something wrong with math as well - 0 ? Not possible
Bump (Leaving a comment because I want a notification once more information is available).
Hi all!
Happy to say that we (most likely) found the problem! (Seems to work on the base model for the subsets I tested)
At this line of our harness fork, we needed to add a patch (added to the harness on main 2 weeks ago) so that gemma2 models also start their evaluation with an added bos token systematically.
I just need to restart the evals and we should be getting updated results very soon.
That line is Gemma specific, would this mean that non Gemma models aren't Influenced by this problem? That's good news, can't wait to see the actual results.
Yep, Gemma models are a bit fickle if you don't launch them exactly like expected - it might also be affecting the recurrent gemma models, which we are looking at atm
Hi! New results should be there! Thanks for your patience and the report! :)