What's the difference between 'acc' and 'acc_norm' metric?

#578
by feiba54 - opened

In some benchmarks, the score of acc_norm is significantly higher than acc. I'd like to know how is acc_norm calculated? Normalized on what?

Open LLM Leaderboard org

Hi!
It's normalized on the length of the answer.

As a very simplified example, imagine you have.

Question: What color is usually an apple?
A. Red 
B. Purple
C. Blue

And the aggregated logprobs for the full sequence at 0.8 for A, 0.8 for B, and 0.1 for C.
Possible best choices are A or B.

When you normalize by the length, you get 0.8/6 for A (A. Red is 6 characters) and 0.8/9 for B.
The best choice (higher logprob) now is A.

In real situations, the normaliation is usually done on the tokens of the sequence, and the score variation depends on how you pass the options (full sequence? only the letter of the choice? ...).

clefourrier changed discussion status to closed

Thanks for your detailed reply!However I would still like to know that, are results reported in model release technical reports(like llama reports) normalized on answer length? Since I noticed that for some dataset the acc_norm score is significantly higher than acc, and only acc_norm matches the score reported in their paper.

feiba54 changed discussion status to open
Open LLM Leaderboard org

Hi!
You have to read each individual technical report of interest to you to see how they define the score they report.

clefourrier changed discussion status to closed

Sign up or log in to comment