What is the command used to evaluate on MMLU?

#3
by PY007 - opened

Thanks for open-sourcing the model and dataset and congrat on the release!

May I ask which command is used to evaluate on MMLU ?

I tried

accelerate launch --num_processes 8 -m lm_eval  --model_args pretrained=HuggingFaceTB/cosmo-1b,dtype=bfloat16,use_flash_attention_2=True \
        --tasks mmlu --num_fewshot 5\
        --batch_size 16

and get the following results:

Groups Version Filter n-shot Metric Value Stderr
mmlu N/A none 0 acc 0.2608 ± 0.0397
- humanities N/A none 5 acc 0.2544 ± 0.0289
- other N/A none 5 acc 0.2671 ± 0.0414
- social_sciences N/A none 5 acc 0.2548 ± 0.0401
- stem N/A none 5 acc 0.2699 ± 0.0491

Scores on OpenLLM leaderboard:

Screenshot 2024-02-24 at 1.16.52 PM.png

Hugging Face TB Research org
edited Mar 4

Thanks for pointing it out, the model was evaluated before we converted it form our training framework to transformers maybe something went wrong, we'll run some tests.

Sign up or log in to comment