DontPlanToEnd/UGI-Leaderboard · Parameter count display issue

SicariusSicariiStuff

Oct 13

•

edited Oct 13

Hello,

LLAMA-3_8B_Unaligned_BETA is shown as 3B instead of 8B,

Not really critical, but if you have the time, please update to 8,
Thank you.

DontPlanToEnd

Owner Oct 14

Sigh. I really need to transition this leaderboard into an automated system. I evaluated the model thinking it was a 3b and set the testing up incorrectly. Its updated results are up. I might have time this winter in December to look into a better evaluation method.

DontPlanToEnd changed discussion status to closed Oct 14

SicariusSicariiStuff

Oct 14

I'm a little bit confused, now the results seems completely different, is that due to random factors?
Pol went down from 33 to 16.7, Unruly went up from 38.3 to 48.3, and so on...

See https://huggingface.co/SicariusSicariiStuff/LLAMA-3_8B_Unaligned_BETA/resolve/main/Images/LLAMA-3_8B_Unaligned_BETA_UGI.png

Are the questions different for 3B and 8B models?

DontPlanToEnd

Owner Oct 14

•

edited Oct 14

As I've done my model rankings, I've tried to make it so they reflect how I and the average person use models. I have always used Q4_K_M quants, and it sounded like most other average llm users don't run Q8 or unquanted models, so I ended up going with Q4_K_M for my tests. Though when I started, I always heard that small models are much more affected by quantization than large models. So I arbitrarily made a rule that models 4B and below will be tested at Q8 quant. As this leaderboard has grown from just an llm ranking in a reddit post, I've had to hold it to more scrutiny and change a lot of the ways I test models. That rule is definitely a thing I'm unhappy about and am planning on getting rid of soon. Even if very small models are more affected by quantization, I feel like surely it's more fair to have all models be tested at the same quant size.
Also, I'd be fine with testing all models at full size instead of Q4_K_M if I didn't have to test large models, but as just a random person using their own money to test models on runpod, I really don't want to have to download and test unquanted 123B and 405B models. And I have no reason to believe that models won't continue to keep getting larger.
I'm really having to vent all the leaderboard's issues right now haha. Fluctuations in results for the 5 categories are usually averaged out into a somewhat more stable UGI score. But the categories individually suffer from not having enough questions to be very stable. Increasing amount of questions is another thing an automated system would really help with. I'm a bit concerned that automated testing could create new issues with false-positives/false-negatives, but I'm sure there's a way to do it right.