Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1016

Cannot reproduce accuracy of mncai/Llama2-7B-guanaco-dolphin-500 gsm8k

#527

by zhentaocc - opened Jan 11

Discussion

zhentaocc

Jan 11

•

edited Jan 14

with batch size = 1, the result I got was 13.12, while the reported is 5.99
I was using python main.py --model=hf-causal-experimental --model_args="pretrained=mncai/Llama2-7B-guanaco-dolphin-500" --tasks=gsm8k --num_fewshot=5 --batch_size=1 --no_cache
And I found different settings for batch size result in different accuracy.

clefourrier

Open LLM Leaderboard org Jan 11

Hi! Did you use the specific commit we report in our About page, and the same precision as the evaluation mentioned above?
If yes, could you please link to the request and result files?

zhentaocc

Jan 14

Where is the result file

zhentaocc

Jan 16

@clefourrier any update?

clefourrier

Open LLM Leaderboard org Jan 16

Hi @zhentaocc ,
Please follow the steps in the FAQ (About tab of the leaderboard) to find the request and results files for your specific model of interest.
Can you also confirm that you used the same commit as we did?

zhentaocc

Jan 19

yes, I used the same commit. @clefourrier

SaylorTwift

Open LLM Leaderboard org Jan 21

You can find your result file here, could you link it so that we can take a look ?

zhentaocc

Jan 24

https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/mncai/Llama2-7B-guanaco-dolphin-500/results_2023-10-25T23-43-24.108245.json#L21-L24
@SaylorTwift

zhentaocc

Jan 31

any update here? @SaylorTwift @clefourrier

clefourrier

Open LLM Leaderboard org Feb 1

•

edited Feb 19

Hi!
Can you compare the predictions of your run to the detailed predictions stored here for this model?

And I found different settings for batch size result in different accuracy.

Yes, this is a known issue of the harness at this commit.

zhentaocc

Feb 4

https://huggingface.co/datasets/open-llm-leaderboard/details_mncai__Llama2-7B-guanaco-dolphin-500/discussions/1

@clefourrier

clefourrier

Open LLM Leaderboard org Feb 5

Hi @zhentaocc ,
I meant actually logging the different predictions you get for each sample :)

zhentaocc

Feb 5

https://huggingface.co/datasets/open-llm-leaderboard/details_mncai__Llama2-7B-guanaco-dolphin-500/discussions/2
@clefourrier
Similar results can be reproduced on both CPU and GPU.

zhentaocc

Feb 5

I wonder if you tried to run the model benchmarking and what's the result? Are you able to reproduce the number reported?

clefourrier

Open LLM Leaderboard org Feb 19

Hi @zhentaocc ,
Thanks a lot for providing the details of your outputs! 🙏
They allowed me to pinpoint the problem: looking in detail at the difference between your outputs and ours, it seems like outputs on our side were truncated on .\n too early.

It's a known bug we identified last year for some models, and fixed in December by re-running 150 models (a mistake on our side, using a test version in prod accidentally, we communicated about it on twitter at the time).
I'm very sorry we missed your model! I relaunched its evaluations.

Closing as the problem was identified, we'll re-check all results file to make sure no other models fell through.
Feel free to reopen if you need.

clefourrier changed discussion status to closed Feb 19

zhentaocc

Feb 21

@clefourrier I saw the new result now, it's much more reasonable. But still a bit different from my side, which is 13.12 vs 12.74?

clefourrier

Open LLM Leaderboard org Feb 21

This is the kind of difference which is within expected margins between different hardwares for example, I'm not suprised

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment