What benchmarks are used for the evaluation?

#6
by zhiminy - opened

Are MMLU, Belebele, HellaSwag, LAMBADA, XCOPA, and ARC Challenge all the evaluation benchmarks?

The previous SQUAD benchmark has been removed from the last version...why?

mii-llm org

Hi,
mmlu, hellas and arc are the main benchmark because they are what it is used by the guys at Mistral to evaluate their models in the Italian language.
there are others eval in the "eval aggiuntive" tab
the squad benchmark has been moved to the "classifica rag" tab

We have planned to add more and better evals for the future!

zhiminy changed discussion status to closed

Hi,
mmlu, hellas and arc are the main benchmark because they are what it is used by the guys at Mistral to evaluate their models in the Italian language.
there are others eval in the "eval aggiuntive" tab
the squad benchmark has been moved to the "classifica rag" tab

We have planned to add more and better evals for the future!

Thanks! @FinancialSupport Why not mention this fact in the documentation? Thus, future readers like me know SQUAD dataset is used for classification evaluation tasks.
image.png

zhiminy changed discussion status to open
mii-llm org

I kinda wanted to not "publicize it" too much since I know some people that train on squad and wanted to avoid contamination!

FinancialSupport changed discussion status to closed

Sign up or log in to comment