open-llm-leaderboard/open_llm_leaderboard · Difference in HF evaluation and local evaluation

Dec 4, 2023

Hi, I tried to evaluate model "ceadar-ie/FinanceConnect-13B" by submitting to hugging face but got poor results and 0 for TruthfulQA. But when I ran the evaluation locally using "Eleuther AI Language Model Evaluation Harness" I got way better results (which I expected as well).

For example:
MMLU the score on HF is: 23.12.
Local run MMLU is: 0.4859 ± 0.1189

Is there any particular reason for this?

clefourrier

Open LLM Leaderboard org Dec 5, 2023

Hi, thank you for your interest in the leaderboard.

Did you use the same parameters for evaluation as us, following the steps in the About for reproducibility?

sham-zafar

Dec 6, 2023

Hi, Thanks for your reply. No, I was using the default parameters earlier. But after you mentioned I have evaluated under those parameters as well.

It turns out that our model has optimal performance when Load_in_8bit=True because of the way it has been fine-tuned.

Is there a way to set this parameter when evaluating for the leaderboard?

If not is it possible to remove the current evaluation from the leaderboard and I'll upload new weights and then submit the model again + Thinking that there was an issue in the evaluation, I resubmitted the model and it is on the waiting list can you remove that as well for now. So other models can be processed in the meantime.

Thanks a lot again for your help! Keep up the good work at HuggingFace :)

clefourrier

Open LLM Leaderboard org Dec 6, 2023

Hi! You can just select your model with precision 8bit and it will run with load_in_8bit=True usign bits and bytes :)
Can you point me to the request files of your model so I can remove the wrong submission?

sham-zafar

Dec 6, 2023

Thanks for the quick reply.

Perfect! In that case, for the second eval that I resubmitted, I had set the precision to 8bit and now it should work fine.

Just the first evaluation, which is currently available on the leaderboard is misleading. I'm assuming that would be overwritten after the second eval right ?

https://huggingface.co/datasets/open-llm-leaderboard/details_ceadar-ie__FinanceConnect-13B

Model: https://huggingface.co/ceadar-ie/FinanceConnect-13B

Thanks.

clefourrier

Open LLM Leaderboard org Dec 6, 2023

It won't be overwritten, as we usually keep everything for transparency. Your model would appear with both precisions in the leaderboard, but if you want, I can remove it instead.

sham-zafar

Dec 6, 2023

Yes please remove the current float16 version on the leaderboard. It was unintended as this was my first time working with the leaderboard.

clefourrier

Open LLM Leaderboard org Dec 6, 2023

It's been done, please wait for up to an hour for results to propagate to the leaderboard!
Thank you for your interest :)

clefourrier changed discussion status to closed Dec 6, 2023

sham-zafar

Dec 8, 2023

•

edited Dec 8, 2023

Hi,
So the evaluation on 8bit is now done and all benchmarks are the same as my local evaluation. The only difference I'm having is for TruthfulQA. It says 0 on the leaderboard. My evaluation results are attached below (all parameters are the same as mentioned in about section of leaderboard).

TruthfulQA: 0-shot, truthfulqa-mc (mc2) should be around '37'. Not sure, why its '0' with huggingface evaluation.

Also, the 16-bit model has not yet been deleted from the leaderboard, can you also please have a look at this and delete that evaluation.

Thanks a lot for all the help!

Tasks	Version	Filter	Metric	Value		Stderr
truthfulqa	N/A	none	acc	0.3114	±	0.0013
		none	bleu_max	18.6447	±	0.6151
		none	bleu_acc	0.3415	±	0.0166
		none	bleu_diff	-3.5435	±	0.5246
		none	rouge1_max	46.9436	±	0.7822
		none	rouge1_acc	0.3329	±	0.0165
		none	rouge1_diff	-6.8122	±	0.6952
		none	rouge2_max	31.6469	±	0.8847
		none	rouge2_acc	0.3146	±	0.0163
		none	rouge2_diff	-7.3906	±	0.8228
		none	rougeL_max	43.2247	±	0.7968
		none	rougeL_acc	0.3341	±	0.0165
		none	rougeL_diff	-7.0703	±	0.6971
- truthfulqa_gen	Yaml	none	bleu_max	18.6447	±	0.6151
		none	bleu_acc	0.3415	±	0.0166
		none	bleu_diff	-3.5435	±	0.5246
		none	rouge1_max	46.9436	±	0.7822
		none	rouge1_acc	0.3329	±	0.0165
		none	rouge1_diff	-6.8122	±	0.6952
		none	rouge2_max	31.6469	±	0.8847
		none	rouge2_acc	0.3146	±	0.0163
		none	rouge2_diff	-7.3906	±	0.8228
		none	rougeL_max	43.2247	±	0.7968
		none	rougeL_acc	0.3341	±	0.0165
		none	rougeL_diff	-7.0703	±	0.6971
- truthfulqa_mc1	Yaml	none	acc	0.2448	±	0.0151
- truthfulqa_mc2	Yaml	none	acc	0.3780	±	0.0152

Groups	Version	Filter	Metric	Value		Stderr
truthfulqa	N/A	none	acc	0.3114	±	0.0013
		none	bleu_max	18.6447	±	0.6151
		none	bleu_acc	0.3415	±	0.0166
		none	bleu_diff	-3.5435	±	0.5246
		none	rouge1_max	46.9436	±	0.7822
		none	rouge1_acc	0.3329	±	0.0165
		none	rouge1_diff	-6.8122	±	0.6952
		none	rouge2_max	31.6469	±	0.8847
		none	rouge2_acc	0.3146	±	0.0163
		none	rouge2_diff	-7.3906	±	0.8228
		none	rougeL_max	43.2247	±	0.7968
		none	rougeL_acc	0.3341	±	0.0165
		none	rougeL_diff	-7.0703	±	0.6971

sham-zafar changed discussion status to open Dec 8, 2023

clefourrier

Open LLM Leaderboard org Dec 12, 2023

•

edited Dec 12, 2023

HI, did you check the difference in the details of your model and your local generations? And what are you using to load your model in 8bit?

sham-zafar

Dec 12, 2023

Hi, there are no differences in the model details/parameters/loading/files - leaderboard eval compared to what I'm evaluating locally.

I'm using about tab of leaderboard to replicate benchmarks: In llm-eval-harness, I'm just setting load_in_8bit = True. I believe that pipeline is the same as the leaderboards pipeline right?

The following command gets me truthfulQA to be around 37, when run locally.
python main.py --model=hf --model_args="pretrained=ceadar-ie/FinanceConnect-13B,load_in_8bit=True" --tasks=truthfulqa --num_fewshot=0 --batch_size=1

Leaderboard - All evals on 8bit model are perfect, the only issue is with truthfulQA. If it can be re-run that would be great.
Leaderboard - Can you please delete the 16-bit model from the leaderboard? That would be great.

If truthfulQA in 8bit cannot be re-run on leaderboard, that's fine. I'll work on the next model from now onwards.

Thanks and regards.

clefourrier

Open LLM Leaderboard org Dec 13, 2023

•

edited Dec 13, 2023

Hi!

Setting load_in_8bit = True is the same as we do :)

The details files (here contain all the outputs of the model for all its inputs - did you compare them, and did you get the same predictions locally? Looking at this file will allow us to better investigate where such a difference could come from.

I won't delete the 16 bit models, as they are not under your username, unless @ceadar-ie can confirm such a deletion is requested?

sham-zafar

Dec 13, 2023

•

edited Dec 13, 2023

Hi! I'm sorry if I have not properly communicated the issue. The screenshot is attached from leaderboard.

Top one is 8-bit precision model eval - truthfulQA is 0. There is the issue (I have no idea why it says 0 here, it has to be around 37). The table I have posted above in this thread is the local eval results for 8-bit model.
here has the correct results:
{
"all": {
"mc1": 0.2484700122399021,
"mc1_stderr": 0.015127427096520672,
"mc2": 0.37682302005478885,
"mc2_stderr": 0.015200964572751172
},
"harness|truthfulqa:mc|0": {
"mc1": 0.2484700122399021,
"mc1_stderr": 0.015127427096520672,
"mc2": 0.37682302005478885,
"mc2_stderr": 0.015200964572751172
}
}
Bottom one is 16-bit model eval- I'll comment below from @ceadar-ie ID to confirm the deletion of this model from leaderboard.

ceadar-ie

Dec 13, 2023

Hi! We confirm the deletion of 16-bit model from the leaderboard.

Thanks.

clefourrier

Open LLM Leaderboard org Dec 14, 2023

I see!
The score reported was 0 because the mc2 score computed contained NaNs - I see that @SaylorTwift fixed it, I'll let him handle the rest of the discussion (we'll probably need to relaunch your 8 bit model entirely since the evaluation happened on 2 different commits)

clefourrier

Open LLM Leaderboard org Jan 5

I think @SaylorTwift relaunched your models, I'm going to close the discussion for now but feel free to reopen if needed

clefourrier changed discussion status to closed Jan 5