Spaces:
Running
on
CPU Upgrade
Cannot reproduce accuracy of mncai/Llama2-7B-guanaco-dolphin-500 gsm8k
with batch size = 1, the result I got was 13.12, while the reported is 5.99
I was using python main.py --model=hf-causal-experimental --model_args="pretrained=mncai/Llama2-7B-guanaco-dolphin-500" --tasks=gsm8k --num_fewshot=5 --batch_size=1 --no_cache
And I found different settings for batch size result in different accuracy.
Hi! Did you use the specific commit we report in our About page, and the same precision as the evaluation mentioned above?
If yes, could you please link to the request and result files?
@clefourrier any update?
Hi
@zhentaocc
,
Please follow the steps in the FAQ (About tab of the leaderboard) to find the request and results files for your specific model of interest.
Can you also confirm that you used the same commit as we did?
yes, I used the same commit. @clefourrier
You can find your result file here, could you link it so that we can take a look ?
any update here? @SaylorTwift @clefourrier
Hi!
Can you compare the predictions of your run to the detailed predictions stored here for this model?
And I found different settings for batch size result in different accuracy.
Yes, this is a known issue of the harness at this commit.
Hi
@zhentaocc
,
I meant actually logging the different predictions you get for each sample :)
https://huggingface.co/datasets/open-llm-leaderboard/details_mncai__Llama2-7B-guanaco-dolphin-500/discussions/2
@clefourrier
Similar results can be reproduced on both CPU and GPU.
I wonder if you tried to run the model benchmarking and what's the result? Are you able to reproduce the number reported?
Hi
@zhentaocc
,
Thanks a lot for providing the details of your outputs! 🙏
They allowed me to pinpoint the problem: looking in detail at the difference between your outputs and ours, it seems like outputs on our side were truncated on .\n
too early.
It's a known bug we identified last year for some models, and fixed in December by re-running 150 models (a mistake on our side, using a test version in prod accidentally, we communicated about it on twitter at the time).
I'm very sorry we missed your model! I relaunched its evaluations.
Closing as the problem was identified, we'll re-check all results file to make sure no other models fell through.
Feel free to reopen if you need.
@clefourrier I saw the new result now, it's much more reasonable. But still a bit different from my side, which is 13.12 vs 12.74?
This is the kind of difference which is within expected margins between different hardwares for example, I'm not suprised