Gemma-2-9B-it scores

#843
by saishf - opened

I think something is wrong with Gemma-2-9B-it's MMLU-Pro score?
You can see in TIGER-Lab/MMLU-Pro that Gemma-2-9B-it slightly beats Phi3-Medium
image.png
But that is not the case on this leaderboard
image.png

image.png

Open LLM Leaderboard org

Hi @saishf ,

Thank you for your message! Yes, we are now checking why both gemma-2-9b-it and gemma-2-9b have low scores. I'll be back with my answer as soon as possible

here is something wrong with math as well - 0 ? Not possible

This comment has been hidden

Bump (Leaving a comment because I want a notification once more information is available).

Open LLM Leaderboard org

Hi all!
Happy to say that we (most likely) found the problem! (Seems to work on the base model for the subsets I tested)

At this line of our harness fork, we needed to add a patch (added to the harness on main 2 weeks ago) so that gemma2 models also start their evaluation with an added bos token systematically.

I just need to restart the evals and we should be getting updated results very soon.

That line is Gemma specific, would this mean that non Gemma models aren't Influenced by this problem? That's good news, can't wait to see the actual results.

Open LLM Leaderboard org

Yep, Gemma models are a bit fickle if you don't launch them exactly like expected - it might also be affecting the recurrent gemma models, which we are looking at atm

Open LLM Leaderboard org

Hi! New results should be there! Thanks for your patience and the report! :)

clefourrier changed discussion status to closed

Sign up or log in to comment