open-llm-leaderboard/open_llm_leaderboard · What's the difference between 'acc' and 'acc

Feb 4

In some benchmarks, the score of acc_norm is significantly higher than acc. I'd like to know how is acc_norm calculated? Normalized on what?

clefourrier

Open LLM Leaderboard org Feb 5

Hi!
It's normalized on the length of the answer.

As a very simplified example, imagine you have.

Question: What color is usually an apple?
A. Red 
B. Purple
C. Blue

And the aggregated logprobs for the full sequence at 0.8 for A, 0.8 for B, and 0.1 for C.
Possible best choices are A or B.

When you normalize by the length, you get 0.8/6 for A (A. Red is 6 characters) and 0.8/9 for B.
The best choice (higher logprob) now is A.

In real situations, the normaliation is usually done on the tokens of the sequence, and the score variation depends on how you pass the options (full sequence? only the letter of the choice? ...).

clefourrier changed discussion status to closed Feb 5

feiba54

Feb 5

Thanks for your detailed reply！However I would still like to know that, are results reported in model release technical reports(like llama reports) normalized on answer length? Since I noticed that for some dataset the acc_norm score is significantly higher than acc, and only acc_norm matches the score reported in their paper.

feiba54 changed discussion status to open Feb 5

clefourrier

Open LLM Leaderboard org Feb 5

Hi!
You have to read each individual technical report of interest to you to see how they define the score they report.

clefourrier changed discussion status to closed Feb 5

Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

What's the difference between 'acc' and 'acc_norm' metric?