I tried to plot AGI on the same Elo scale by comparing to "both bad" and "tie" votes

#67
by endolith - opened

(Or, rather, I had an LLM write it for me. (But another LLM checked it and said it was correct, so...))

When a battle is voted as a tie, the "ideal model" is also considered to have tied with both. When a battle is voted as "both bad", then the ideal model is considered to have beaten both. So it acts as an upper bound for Elo scores, and since the judgments are from humans, a model that scores that well all the time would be human-equivalent?

https://gist.github.com/endolith/e001d8b7811699cf9be822a774e7cb67

A scatter plot showing Elo rating estimates for various large language models from the LMSYS Chatbot Arena. The x-axis lists the models, such as GPT-4, LLaMA, and others, with their names displayed vertically. The y-axis represents the Elo rating. Each point represents a model's median Elo rating with error bars indicating the 95% confidence intervals. The highest-rated models are on the left, and the ratings decrease as you move to the right.

Sign up or log in to comment