Spaces:
Running
on
CPU Upgrade
Difference in HF evaluation and local evaluation
Hi, I tried to evaluate model "ceadar-ie/FinanceConnect-13B" by submitting to hugging face but got poor results and 0 for TruthfulQA. But when I ran the evaluation locally using "Eleuther AI Language Model Evaluation Harness" I got way better results (which I expected as well).
For example:
MMLU the score on HF is: 23.12.
Local run MMLU is: 0.4859 ± 0.1189
Is there any particular reason for this?
Hi, thank you for your interest in the leaderboard.
Did you use the same parameters for evaluation as us, following the steps in the About for reproducibility?
Hi, Thanks for your reply. No, I was using the default parameters earlier. But after you mentioned I have evaluated under those parameters as well.
It turns out that our model has optimal performance when Load_in_8bit=True because of the way it has been fine-tuned.
Is there a way to set this parameter when evaluating for the leaderboard?
If not is it possible to remove the current evaluation from the leaderboard and I'll upload new weights and then submit the model again + Thinking that there was an issue in the evaluation, I resubmitted the model and it is on the waiting list can you remove that as well for now. So other models can be processed in the meantime.
Thanks a lot again for your help! Keep up the good work at HuggingFace :)
Hi! You can just select your model with precision 8bit
and it will run with load_in_8bit=True
usign bits and bytes :)
Can you point me to the request files of your model so I can remove the wrong submission?
Thanks for the quick reply.
Perfect! In that case, for the second eval that I resubmitted, I had set the precision to 8bit and now it should work fine.
Just the first evaluation, which is currently available on the leaderboard is misleading. I'm assuming that would be overwritten after the second eval right ?
https://huggingface.co/datasets/open-llm-leaderboard/details_ceadar-ie__FinanceConnect-13B
Model: https://huggingface.co/ceadar-ie/FinanceConnect-13B
Thanks.
It won't be overwritten, as we usually keep everything for transparency. Your model would appear with both precisions in the leaderboard, but if you want, I can remove it instead.
Yes please remove the current float16 version on the leaderboard. It was unintended as this was my first time working with the leaderboard.
It's been done, please wait for up to an hour for results to propagate to the leaderboard!
Thank you for your interest :)
Hi,
So the evaluation on 8bit is now done and all benchmarks are the same as my local evaluation. The only difference I'm having is for TruthfulQA. It says 0 on the leaderboard. My evaluation results are attached below (all parameters are the same as mentioned in about section of leaderboard).
TruthfulQA: 0-shot, truthfulqa-mc (mc2) should be around '37'. Not sure, why its '0' with huggingface evaluation.
Also, the 16-bit model has not yet been deleted from the leaderboard, can you also please have a look at this and delete that evaluation.
Thanks a lot for all the help!
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|
truthfulqa | N/A | none | 0 | acc | 0.3114 | ± | 0.0013 |
none | 0 | bleu_max | 18.6447 | ± | 0.6151 | ||
none | 0 | bleu_acc | 0.3415 | ± | 0.0166 | ||
none | 0 | bleu_diff | -3.5435 | ± | 0.5246 | ||
none | 0 | rouge1_max | 46.9436 | ± | 0.7822 | ||
none | 0 | rouge1_acc | 0.3329 | ± | 0.0165 | ||
none | 0 | rouge1_diff | -6.8122 | ± | 0.6952 | ||
none | 0 | rouge2_max | 31.6469 | ± | 0.8847 | ||
none | 0 | rouge2_acc | 0.3146 | ± | 0.0163 | ||
none | 0 | rouge2_diff | -7.3906 | ± | 0.8228 | ||
none | 0 | rougeL_max | 43.2247 | ± | 0.7968 | ||
none | 0 | rougeL_acc | 0.3341 | ± | 0.0165 | ||
none | 0 | rougeL_diff | -7.0703 | ± | 0.6971 | ||
- truthfulqa_gen | Yaml | none | 0 | bleu_max | 18.6447 | ± | 0.6151 |
none | 0 | bleu_acc | 0.3415 | ± | 0.0166 | ||
none | 0 | bleu_diff | -3.5435 | ± | 0.5246 | ||
none | 0 | rouge1_max | 46.9436 | ± | 0.7822 | ||
none | 0 | rouge1_acc | 0.3329 | ± | 0.0165 | ||
none | 0 | rouge1_diff | -6.8122 | ± | 0.6952 | ||
none | 0 | rouge2_max | 31.6469 | ± | 0.8847 | ||
none | 0 | rouge2_acc | 0.3146 | ± | 0.0163 | ||
none | 0 | rouge2_diff | -7.3906 | ± | 0.8228 | ||
none | 0 | rougeL_max | 43.2247 | ± | 0.7968 | ||
none | 0 | rougeL_acc | 0.3341 | ± | 0.0165 | ||
none | 0 | rougeL_diff | -7.0703 | ± | 0.6971 | ||
- truthfulqa_mc1 | Yaml | none | 0 | acc | 0.2448 | ± | 0.0151 |
- truthfulqa_mc2 | Yaml | none | 0 | acc | 0.3780 | ± | 0.0152 |
Groups | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|
truthfulqa | N/A | none | 0 | acc | 0.3114 | ± | 0.0013 |
none | 0 | bleu_max | 18.6447 | ± | 0.6151 | ||
none | 0 | bleu_acc | 0.3415 | ± | 0.0166 | ||
none | 0 | bleu_diff | -3.5435 | ± | 0.5246 | ||
none | 0 | rouge1_max | 46.9436 | ± | 0.7822 | ||
none | 0 | rouge1_acc | 0.3329 | ± | 0.0165 | ||
none | 0 | rouge1_diff | -6.8122 | ± | 0.6952 | ||
none | 0 | rouge2_max | 31.6469 | ± | 0.8847 | ||
none | 0 | rouge2_acc | 0.3146 | ± | 0.0163 | ||
none | 0 | rouge2_diff | -7.3906 | ± | 0.8228 | ||
none | 0 | rougeL_max | 43.2247 | ± | 0.7968 | ||
none | 0 | rougeL_acc | 0.3341 | ± | 0.0165 | ||
none | 0 | rougeL_diff | -7.0703 | ± | 0.6971 |
HI, did you check the difference in the details of your model and your local generations? And what are you using to load your model in 8bit?
Hi, there are no differences in the model details/parameters/loading/files - leaderboard eval compared to what I'm evaluating locally.
I'm using about tab of leaderboard to replicate benchmarks: In llm-eval-harness, I'm just setting load_in_8bit = True. I believe that pipeline is the same as the leaderboards pipeline right?
The following command gets me truthfulQA to be around 37, when run locally.
python main.py --model=hf --model_args="pretrained=ceadar-ie/FinanceConnect-13B,load_in_8bit=True" --tasks=truthfulqa --num_fewshot=0 --batch_size=1
Leaderboard - All evals on 8bit model are perfect, the only issue is with truthfulQA. If it can be re-run that would be great.
Leaderboard - Can you please delete the 16-bit model from the leaderboard? That would be great.
If truthfulQA in 8bit cannot be re-run on leaderboard, that's fine. I'll work on the next model from now onwards.
Thanks and regards.
Hi!
Setting load_in_8bit = True
is the same as we do :)
The details files (here contain all the outputs of the model for all its inputs - did you compare them, and did you get the same predictions locally? Looking at this file will allow us to better investigate where such a difference could come from.
I won't delete the 16 bit models, as they are not under your username, unless @ceadar-ie can confirm such a deletion is requested?
Hi! I'm sorry if I have not properly communicated the issue. The screenshot is attached from leaderboard.
Top one is 8-bit precision model eval - truthfulQA is 0. There is the issue (I have no idea why it says 0 here, it has to be around 37). The table I have posted above in this thread is the local eval results for 8-bit model.
here has the correct results:
{
"all": {
"mc1": 0.2484700122399021,
"mc1_stderr": 0.015127427096520672,
"mc2": 0.37682302005478885,
"mc2_stderr": 0.015200964572751172
},
"harness|truthfulqa:mc|0": {
"mc1": 0.2484700122399021,
"mc1_stderr": 0.015127427096520672,
"mc2": 0.37682302005478885,
"mc2_stderr": 0.015200964572751172
}
}Bottom one is 16-bit model eval- I'll comment below from @ceadar-ie ID to confirm the deletion of this model from leaderboard.
Hi! We confirm the deletion of 16-bit model from the leaderboard.
Thanks.
I see!
The score reported was 0 because the mc2
score computed contained NaNs - I see that
@SaylorTwift
fixed it, I'll let him handle the rest of the discussion (we'll probably need to relaunch your 8 bit model entirely since the evaluation happened on 2 different commits)
I think @SaylorTwift relaunched your models, I'm going to close the discussion for now but feel free to reopen if needed