Difference in HF evaluation and local evaluation

#424
by sham-zafar - opened

Hi, I tried to evaluate model "ceadar-ie/FinanceConnect-13B" by submitting to hugging face but got poor results and 0 for TruthfulQA. But when I ran the evaluation locally using "Eleuther AI Language Model Evaluation Harness" I got way better results (which I expected as well).

For example:
MMLU the score on HF is: 23.12.
Local run MMLU is: 0.4859 ± 0.1189

Is there any particular reason for this?

Open LLM Leaderboard org

Hi, thank you for your interest in the leaderboard.

Did you use the same parameters for evaluation as us, following the steps in the About for reproducibility?

Hi, Thanks for your reply. No, I was using the default parameters earlier. But after you mentioned I have evaluated under those parameters as well.

It turns out that our model has optimal performance when Load_in_8bit=True because of the way it has been fine-tuned.

Is there a way to set this parameter when evaluating for the leaderboard?

If not is it possible to remove the current evaluation from the leaderboard and I'll upload new weights and then submit the model again + Thinking that there was an issue in the evaluation, I resubmitted the model and it is on the waiting list can you remove that as well for now. So other models can be processed in the meantime.

Thanks a lot again for your help! Keep up the good work at HuggingFace :)

Open LLM Leaderboard org

Hi! You can just select your model with precision 8bit and it will run with load_in_8bit=True usign bits and bytes :)
Can you point me to the request files of your model so I can remove the wrong submission?

Thanks for the quick reply.

Perfect! In that case, for the second eval that I resubmitted, I had set the precision to 8bit and now it should work fine.

Just the first evaluation, which is currently available on the leaderboard is misleading. I'm assuming that would be overwritten after the second eval right ?

https://huggingface.co/datasets/open-llm-leaderboard/details_ceadar-ie__FinanceConnect-13B

Model: https://huggingface.co/ceadar-ie/FinanceConnect-13B

Thanks.

Open LLM Leaderboard org

It won't be overwritten, as we usually keep everything for transparency. Your model would appear with both precisions in the leaderboard, but if you want, I can remove it instead.

Yes please remove the current float16 version on the leaderboard. It was unintended as this was my first time working with the leaderboard.

Open LLM Leaderboard org

It's been done, please wait for up to an hour for results to propagate to the leaderboard!
Thank you for your interest :)

clefourrier changed discussion status to closed

Hi,
So the evaluation on 8bit is now done and all benchmarks are the same as my local evaluation. The only difference I'm having is for TruthfulQA. It says 0 on the leaderboard. My evaluation results are attached below (all parameters are the same as mentioned in about section of leaderboard).

TruthfulQA: 0-shot, truthfulqa-mc (mc2) should be around '37'. Not sure, why its '0' with huggingface evaluation.

Also, the 16-bit model has not yet been deleted from the leaderboard, can you also please have a look at this and delete that evaluation.

Thanks a lot for all the help!

Tasks Version Filter n-shot Metric Value Stderr
truthfulqa N/A none 0 acc 0.3114 ± 0.0013
none 0 bleu_max 18.6447 ± 0.6151
none 0 bleu_acc 0.3415 ± 0.0166
none 0 bleu_diff -3.5435 ± 0.5246
none 0 rouge1_max 46.9436 ± 0.7822
none 0 rouge1_acc 0.3329 ± 0.0165
none 0 rouge1_diff -6.8122 ± 0.6952
none 0 rouge2_max 31.6469 ± 0.8847
none 0 rouge2_acc 0.3146 ± 0.0163
none 0 rouge2_diff -7.3906 ± 0.8228
none 0 rougeL_max 43.2247 ± 0.7968
none 0 rougeL_acc 0.3341 ± 0.0165
none 0 rougeL_diff -7.0703 ± 0.6971
- truthfulqa_gen Yaml none 0 bleu_max 18.6447 ± 0.6151
none 0 bleu_acc 0.3415 ± 0.0166
none 0 bleu_diff -3.5435 ± 0.5246
none 0 rouge1_max 46.9436 ± 0.7822
none 0 rouge1_acc 0.3329 ± 0.0165
none 0 rouge1_diff -6.8122 ± 0.6952
none 0 rouge2_max 31.6469 ± 0.8847
none 0 rouge2_acc 0.3146 ± 0.0163
none 0 rouge2_diff -7.3906 ± 0.8228
none 0 rougeL_max 43.2247 ± 0.7968
none 0 rougeL_acc 0.3341 ± 0.0165
none 0 rougeL_diff -7.0703 ± 0.6971
- truthfulqa_mc1 Yaml none 0 acc 0.2448 ± 0.0151
- truthfulqa_mc2 Yaml none 0 acc 0.3780 ± 0.0152
Groups Version Filter n-shot Metric Value Stderr
truthfulqa N/A none 0 acc 0.3114 ± 0.0013
none 0 bleu_max 18.6447 ± 0.6151
none 0 bleu_acc 0.3415 ± 0.0166
none 0 bleu_diff -3.5435 ± 0.5246
none 0 rouge1_max 46.9436 ± 0.7822
none 0 rouge1_acc 0.3329 ± 0.0165
none 0 rouge1_diff -6.8122 ± 0.6952
none 0 rouge2_max 31.6469 ± 0.8847
none 0 rouge2_acc 0.3146 ± 0.0163
none 0 rouge2_diff -7.3906 ± 0.8228
none 0 rougeL_max 43.2247 ± 0.7968
none 0 rougeL_acc 0.3341 ± 0.0165
none 0 rougeL_diff -7.0703 ± 0.6971
sham-zafar changed discussion status to open
Open LLM Leaderboard org
edited Dec 12, 2023

HI, did you check the difference in the details of your model and your local generations? And what are you using to load your model in 8bit?

Hi, there are no differences in the model details/parameters/loading/files - leaderboard eval compared to what I'm evaluating locally.

I'm using about tab of leaderboard to replicate benchmarks: In llm-eval-harness, I'm just setting load_in_8bit = True. I believe that pipeline is the same as the leaderboards pipeline right?

The following command gets me truthfulQA to be around 37, when run locally.
python main.py --model=hf --model_args="pretrained=ceadar-ie/FinanceConnect-13B,load_in_8bit=True" --tasks=truthfulqa --num_fewshot=0 --batch_size=1

Leaderboard - All evals on 8bit model are perfect, the only issue is with truthfulQA. If it can be re-run that would be great.
Leaderboard - Can you please delete the 16-bit model from the leaderboard? That would be great.

If truthfulQA in 8bit cannot be re-run on leaderboard, that's fine. I'll work on the next model from now onwards.

Thanks and regards.

Open LLM Leaderboard org
edited Dec 13, 2023

Hi!

Setting load_in_8bit = True is the same as we do :)

The details files (here contain all the outputs of the model for all its inputs - did you compare them, and did you get the same predictions locally? Looking at this file will allow us to better investigate where such a difference could come from.

I won't delete the 16 bit models, as they are not under your username, unless @ceadar-ie can confirm such a deletion is requested?

Hi! I'm sorry if I have not properly communicated the issue. The screenshot is attached from leaderboard.

  1. Top one is 8-bit precision model eval - truthfulQA is 0. There is the issue (I have no idea why it says 0 here, it has to be around 37). The table I have posted above in this thread is the local eval results for 8-bit model.
    here has the correct results:
    {
    "all": {
    "mc1": 0.2484700122399021,
    "mc1_stderr": 0.015127427096520672,
    "mc2": 0.37682302005478885,
    "mc2_stderr": 0.015200964572751172
    },
    "harness|truthfulqa:mc|0": {
    "mc1": 0.2484700122399021,
    "mc1_stderr": 0.015127427096520672,
    "mc2": 0.37682302005478885,
    "mc2_stderr": 0.015200964572751172
    }
    }

  2. Bottom one is 16-bit model eval- I'll comment below from @ceadar-ie ID to confirm the deletion of this model from leaderboard.

Screenshot 2023-12-13 at 9.50.02 PM.png

Hi! We confirm the deletion of 16-bit model from the leaderboard.

Thanks.

Open LLM Leaderboard org

I see!
The score reported was 0 because the mc2 score computed contained NaNs - I see that @SaylorTwift fixed it, I'll let him handle the rest of the discussion (we'll probably need to relaunch your 8 bit model entirely since the evaluation happened on 2 different commits)

Open LLM Leaderboard org

I think @SaylorTwift relaunched your models, I'm going to close the discussion for now but feel free to reopen if needed

clefourrier changed discussion status to closed

Sign up or log in to comment