Spaces:

mii-llm
/

open_ita_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

What benchmarks are used for the evaluation?

by zhiminy - opened Mar 30

Discussion

zhiminy

Mar 30

•

edited Mar 30

Are MMLU, Belebele, HellaSwag, LAMBADA, XCOPA, and ARC Challenge all the evaluation benchmarks?

zhiminy

Mar 30

•

edited Mar 30

The previous SQUAD benchmark has been removed from the last version...why?

FinancialSupport

mii-llm org Apr 1

Hi,
mmlu, hellas and arc are the main benchmark because they are what it is used by the guys at Mistral to evaluate their models in the Italian language.
there are others eval in the "eval aggiuntive" tab
the squad benchmark has been moved to the "classifica rag" tab

We have planned to add more and better evals for the future!

zhiminy changed discussion status to closed Apr 1

zhiminy

Apr 1

Hi,
mmlu, hellas and arc are the main benchmark because they are what it is used by the guys at Mistral to evaluate their models in the Italian language.
there are others eval in the "eval aggiuntive" tab
the squad benchmark has been moved to the "classifica rag" tab

We have planned to add more and better evals for the future!

Thanks! @FinancialSupport Why not mention this fact in the documentation? Thus, future readers like me know SQUAD dataset is used for classification evaluation tasks.

zhiminy changed discussion status to open Apr 1

FinancialSupport

mii-llm org Apr 12

I kinda wanted to not "publicize it" too much since I know some people that train on squad and wanted to avoid contamination!

FinancialSupport changed discussion status to closed Apr 12

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment