Diference in Elo between HF Leaderboard and Colab Notebook
There is a slight difference in Elo Rating between the HF Leaderboard and the one calculated by the Colab Notebook
HF Leaderboard: (elo_results_20240329.pkl file)
Colab Notebook:
When I run the elo_analysis.py script from the lm-sys/FastChat github repository using the default arguments, I also get the exact same Elo values as the notebook version.
My question is: do you guys use different parameters from the elo_analysis.py defaults to generate the elo_results_$DATE.pkl files? Which ones?
Hey
@eduagarcia
thanks for reporting this issue. I investigated this and verified that the data & parameters are exactly the same as "elo_analysis.py".
But the difference is from numerical error when solving the MLE problem with logistic regression.
On our machine (this is the one published on 3/29):
with lr = LogisticRegression(fit_intercept=False, penalty=None)
Number of battles: 511252
claude-3-opus-20240229 1254.64
gpt-4-1106-preview 1251.88
gpt-4-0125-preview 1249.17
bard-jan-24-gemini-pro 1204.35
claude-3-sonnet-20240229 1200.29
When I set a tighter tolerance 1e-8, with lr = LogisticRegression(fit_intercept=False, penalty=None, tol=1e-8)
Number of battles: 511252
claude-3-opus-20240229 1254.40
gpt-4-1106-preview 1251.44
gpt-4-0125-preview 1248.89
bard-jan-24-gemini-pro 1204.28
claude-3-sonnet-20240229 1200.06
it matches the one on notebook:
I'll update our code to set a tighter tolerance in our next release.
Does it make sense to you @eduagarcia ?
Got it, it does.
Thank you for your time