Inquiry Regarding Accuracy Calculation for GSM8K Metric

#625
by qwerwxy - opened

I am writing to inquire about the accuracy calculation for the GSM8K metric, as it shows low values across many models. For instance, in the open_llm_leaderboard, the GSM8K score for Aquila2-34B is recorded as 0.61%. However, upon reviewing the results provided in this link(https://huggingface.co/datasets/open-llm-leaderboard/details_BAAI__Aquila2-34B/blob/main/2024-01-15T18-37-14.451844/details_harness%7Cgsm8k%7C5_2024-01-15T18-37-14.451844.parquet), I calculated the accuracy to be 56.25%.

To better understand this significant discrepancy and facilitate troubleshooting, I kindly request the possibility of open-sourcing the script utilized for accuracy calculation.

Thank you for your attention to this matter.

Open LLM Leaderboard org

Hi @qwerwxy ,
Super cool that you took a look at the details!

All the code we use is open, and you can reproduce our results using the steps in Reproducibility in the About tab of the Open LLM Leaderboard.
Please be aware that GSM8K expects result in a very specific format, ### answer, and will penalize models which do not answer like so.

Open LLM Leaderboard org

Hi!
Closing this issue since it's been inactive for a week, but feel free to reopen if you have more questions.

clefourrier changed discussion status to closed

Sign up or log in to comment