Update on Leaderboard Results - Bug in Macro F1 Score Calculation

#11
by eduagarcia - opened

The team at @maritaca-ai helped me identify a bug that affected some models in tasks where the F1 score is the evaluation metric. The issue happens when a model generates an invalid response, leading to the calculation of the F1 score with an macro average. Previously, the code incorrectly included the placeholder "[invalid]" as an additional class in the score averaging process. This significantly lowered the scores for models on tasks where they produced any invalid responses.

The evaluation code has been updated to exclude the "[invalid]" tag, and the leaderboard results have been revised to reflect the correct values. As a result, some models may have changed in rank and overall average score.

Below is the list of affected models:
Models not on this list are not affected.

Model Name	Precision	Old_score->New_score
152334H/miqu-1-70b-sf	float16	71.51->73.50
abhishek/autotrain-llama3-orpo-v2	bfloat16	10.62->13.68
ai-forever/mGPT-13B	float16	9.61->12.19
allenai/tulu-2-dpo-70b	bfloat16	72.13->74.13
allknowingroger/MultiverseEx26-7B-slerp	bfloat16	67.70->69.52
argilla/CapybaraHermes-2.5-Mistral-7B	float16	65.08->66.67
automerger/YamshadowExperiment28-7B	bfloat16	67.79->69.62
argilla/notux-8x7b-v1	bfloat16	67.69->73.10
axolotl-ai-co/romulus-mistral-nemo-12b-simpo	bfloat16	69.32->71.97
baichuan-inc/Baichuan2-13B-Chat	bfloat16	43.66->46.84
BAAI/Infinity-Instruct-3M-0625-Mistral-7B	bfloat16	69.01->70.91
BAAI/Infinity-Instruct-3M-0613-Mistral-7B	float16	68.25->70.10
bardsai/jaskier-7b-dpo-v5.6	bfloat16	67.58->69.41
berkeley-nest/Starling-LM-7B-alpha	bfloat16	67.90->69.59
bardsai/jaskier-7b-dpo-v5.6	float16	67.76->69.65
chujiezheng/Llama-3-Instruct-8B-SimPO-ExPO	bfloat16	65.82->67.71
cognitivecomputations/dolphin-2.9.3-mistral-7B-32k	bfloat16	65.03->66.84
cognitivecomputations/openchat-3.5-0106-laser	bfloat16	68.35->70.18
cognitivecomputations/WestLake-7B-v2-laser	bfloat16	67.00->68.80
cognitivecomputations/laserxtral	bfloat16	67.52->69.32
Columbia-NLP/LION-LLaMA-3-8b-odpo-v1.0	bfloat16	58.43->68.93
cognitivecomputations/dolphin-2.9.3-mistral-7B-32k	bfloat16	65.03->66.84
Danielbrdz/Barcenas-14b-Phi-3-medium-ORPO	float16	70.17->72.03
CohereForAI/c4ai-command-r-v01	float16	66.49->68.28
Danielbrdz/Barcenas-Llama3-8b-ORPO	float16	68.25->70.10
CultriX/NeuralMona_MoE-4x7B	bfloat16	67.32->69.11
DeepMount00/Llama-3-8b-Ita	bfloat16	68.78->70.65
dominguesm/mambarim-110m	float16	14.16->18.01
eduagarcia/gemma-7b-it_no_chat_template	bfloat16	55.17->57.28
dzakwan/dzakwan-MoE-4x7b-Beta	float16	53.52->55.83
eldogbbhed/Peagle-9b	float16	51.89->53.35
EleutherAI/pythia-14m	float16	18.90->22.62
EleutherAI/pythia-70m-deduped	float16	19.37->25.59
EleutherAI/pythia-70m	float16	22.73->23.18
failspy/Meta-Llama-3-8B-Instruct-abliterated-v3	bfloat16	68.82->70.65
failspy/Phi-3-medium-4k-instruct-abliterated-v3	bfloat16	68.92->70.66
freewheelin/free-solar-evo-v0.1	float16	43.61->51.79
freewheelin/free-solar-evo-v0.11	float16	44.30->52.72
freewheelin/free-solar-evo-v0.13	float16	46.17->55.48
FuseAI/FuseChat-7B-VaRM	bfloat16	67.49->69.12
ghost-x/ghost-8b-beta	bfloat16	61.66->63.66
ghost-x/ghost-8b-beta-1608	bfloat16	60.98->62.97
google/mt5-base	bfloat16	8.87->10.16
google/mt5-small	bfloat16	0.81->0.81
GritLM/GritLM-7B-KTO	bfloat16	65.04->66.71
google/mt5-base	float16	8.89->10.25
grimjim/Llama-3-Instruct-8B-SPPO-Iter3-SimPO-merge	bfloat16	68.04->69.80
GritLM/GritLM-7B	bfloat16	65.84->67.52
HuggingFaceH4/zephyr-7b-beta	bfloat16	62.77->64.47
HuggingFaceTB/SmolLM-1.7B-Instruct	bfloat16	17.65->23.10
hkust-nlp/deita-7b-v1.0	bfloat16	64.48->66.32
HuggingFaceTB/SmolLM-135M-Instruct	bfloat16	13.00->16.02
HuggingFaceTB/SmolLM-360M-Instruct	bfloat16	17.69->21.05
ibivibiv/llama-3-nectar-dpo-8B	bfloat16	68.37->70.19
Intel/neural-chat-7b-v3-3	float16	65.34->67.07
ibivibiv/multimaster-7b-v6	bfloat16	67.35->69.20
Intel/neural-chat-7b-v3-1	float16	67.27->69.17
internlm/internlm2-chat-20b	float16	64.59->67.58
internlm/internlm2_5-1_8b	bfloat16	36.04->37.67
internlm/internlm2-chat-20b-sft	float16	59.35->64.76
internlm/internlm2_5-20b-chat	bfloat16	49.84->56.94
invalid-coder/Sakura-SOLAR-Instruct-CarbonVillain-en-10.7B-v2-slerp	float16	69.40->71.37
jeonsworld/CarbonVillain-en-10.7B-v4	bfloat16	69.35->71.31
JJhooww/Mistral_Relora_Step2k	float16	64.42->66.34
internlm/internlm2-chat-20b	float16	64.59->67.58
jsfs11/MixtureofMerges-MoE-4x7b-v5	bfloat16	67.71->69.53
JJhooww/Mistral_Relora_Step2k	bfloat16	64.22->66.13
jpacifico/Chocolatine-14B-Instruct-4k-DPO	float16	69.85->71.69
jsfs11/MixtureofMerges-MoE-4x7b-v4	bfloat16	67.63->69.44
Kquant03/CognitiveFusion2-4x7B-BF16	bfloat16	67.64->69.47
kekmodel/StopCarbon-10.7B-v5	float16	69.62->71.61
Kukedlc/NeuralExperiment-7b-MagicCoder-v7.5	float16	67.37->69.21
Kukedlc/NeuralSynthesis-7B-v0.1	bfloat16	67.76->69.57
Kukedlc/NeuralSynthesis-7b-v0.4-slerp	bfloat16	67.71->69.54
LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct	bfloat16	61.84->64.41
Kukedlc/NeuralSynthesis-7B-v0.3	bfloat16	67.69->69.52
liminerity/M7-7b	bfloat16	67.72->69.55
lrds-code/boana-7b-instruct	bfloat16	44.57->46.07
M4-ai/tau-0.5B	float16	26.38->27.61
lucianosb/boto-27B	bfloat16	27.47->35.78
lrds-code/samba-1.1B	bfloat16	16.89->20.51
M4-ai/tau-1.8B	bfloat16	31.82->36.40
Magpie-Align/Llama-3-8B-Magpie-Align-v0.3	bfloat16	51.26->63.60
matheusrdgsf/cesar-ptbr	GPTQ	59.22->64.04
maywell/Synatra-7B-v0.3-RP	float16	57.67->60.98
MaziyarPanahi/Llama-3-8B-Instruct-v0.8	bfloat16	68.85->70.72
MaziyarPanahi/Llama-3-8B-Instruct-v0.10	bfloat16	68.77->70.63
MaziyarPanahi/Mistral-7B-Instruct-v0.3	bfloat16	66.30->68.06
MaziyarPanahi/Calme-4x7B-MoE-v0.2	bfloat16	49.00->50.77
MaziyarPanahi/Mistral-7B-Instruct-Aya-101	bfloat16	64.63->66.49
MaziyarPanahi/Calme-4x7B-MoE-v0.1	bfloat16	49.12->50.94
MaziyarPanahi/Llama-3-8B-Instruct-v0.9	bfloat16	68.86->70.71
MaziyarPanahi/Topxtral-4x7B-v0.1	bfloat16	67.48->69.28
meraGPT/mera-mix-4x7B	bfloat16	67.72->69.52
meta-llama/Llama-2-7b-chat-hf	bfloat16	42.36->52.20
microsoft/phi-1_5	float16	28.41->29.64
microsoft/Phi-3-medium-4k-instruct	bfloat16	70.42->72.26
mistralai/Mixtral-8x7B-Instruct-v0.1	bfloat16	69.71->73.14
mistralai/Mistral-7B-Instruct-v0.2	bfloat16	64.81->66.68
mistralai/Mistral-7B-Instruct-v0.3	bfloat16	66.30->68.06
mlabonne/AlphaMonarch-7B	float16	50.16->53.62
mlabonne/Beyonder-4x7B-v3	float16	53.47->55.79
mlabonne/Monarch-7B	bfloat16	67.01->68.80
mlabonne/Llama-3-8B-Instruct-abliterated-dpomix	float16	68.53->70.35
MulaBR/Mula-4x160-v0.1	float16	26.24->26.66
mlabonne/NeuralMonarch-7B	float16	50.30->53.78
MulaBR/Mula-8x160-v0.1	float16	25.72->27.65
Nexusflow/Starling-LM-7B-beta	bfloat16	69.03->70.90
nicholasKluge/TeenyTinyLlama-160m	bfloat16	28.20->28.62
MTSAIR/multi_verse_model	bfloat16	48.69->53.95
NLPark/AnFeng_v3_Avocet	bfloat16	16.63->23.20
NousResearch/Nous-Hermes-2-Mistral-7B-DPO	bfloat16	61.73->66.75
NOVA-vision-language/GlorIA-1.3B	float16	4.10->5.44
OliveiraJLT/Sagui-7B-Instruct-v0.1	bfloat16	39.87->41.56
openchat/openchat-3.5-0106	bfloat16	68.69->70.55
openai-community/openai-gpt	float16	1.58->1.96
OliveiraJLT/Sagui-7B-Instruct-v0.1	bfloat16	39.87->41.56
OliveiraJLT/Sagui-7B-Instruct-v0.1	bfloat16	39.87->41.56
paulml/OGNO-7B	bfloat16	67.63->69.45
princeton-nlp/Llama-3-Instruct-8B-SimPO	bfloat16	66.43->68.31
princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2	bfloat16	55.06->68.41
Qwen/Qwen-72B-Chat	bfloat16	30.80->33.70
Qwen/Qwen1.5-0.5B	bfloat16	25.74->28.75
Qwen/Qwen-1_8B-Chat	bfloat16	37.65->39.27
Qwen/Qwen1.5-1.8B	bfloat16	30.14->32.66
Qwen/Qwen1.5-110B-Chat	bfloat16	72.74->74.67
Qwen/Qwen-1_8B-Chat	float16	36.70->38.32
Qwen/Qwen1.5-32B	bfloat16	62.88->64.32
Qwen/Qwen1.5-110B-Chat	4bit	72.51->74.41
Qwen/Qwen2-0.5B	bfloat16	27.58->30.14
Ramikan-BR/tinyllama-coder-py-4bit-v10	float16	27.68->29.62
recogna-nlp/bode-7b-alpaca-pt-br	float16	53.21->54.82
recogna-nlp/mistralbode_7b_qlora_ultraalpaca	float16	63.57->65.35
rhaymison/Mistral-8x7b-portuguese-luana	float16	66.05->71.33
rhaymison/gemma-portuguese-tom-cat-2b-it	float16	31.76->36.70
rhaymison/gemma-portuguese-2b-it	bfloat16	4.23->6.35
rhaymison/Mistral-portuguese-luana-7b-Mathematics	float16	63.60->65.41
rishiraj/CatPPT-base	float16	67.92->69.65
RLHFlow/LLaMA3-iterative-DPO-final	bfloat16	61.53->68.95
rishiraj/CatPPT	bfloat16	68.06->69.80
RubielLabarta/LogoS-7Bx2-MoE-13B-v0.2	bfloat16	67.55->69.37
royallab/ZephRP-m7b	bfloat16	63.25->64.98
rombodawg/Everyone-Coder-4x7b-Base	float16	64.12->65.78
saltlux/luxia-21.4b-alignment-v1.2	bfloat16	66.25->68.09
shadowml/BeagSake-7B	bfloat16	52.74->56.79
rhaymison/Mistral-portuguese-luana-7b-Mathematics	bfloat16	63.45->65.26
SeaLLMs/SeaLLM-7B-v2	bfloat16	66.49->68.15
saltlux/luxia-21.4b-alignment-v1.0	bfloat16	67.27->69.10
ssmits/Falcon2-5.5B-Portuguese	bfloat16	0.56->0.73
ssmits/Falcon2-5.5B-multilingual	bfloat16	0.56->0.73
state-spaces/mamba-1.4b-hf	float16	27.72->29.62
saltlux/luxia-21.4b-alignment-v1.0	float16	67.23->69.06
THUDM/chatglm3-6b	float16	50.44->56.00
teknium/OpenHermes-2-Mistral-7B	bfloat16	63.76->65.51
teknium/OpenHermes-2.5-Mistral-7B	bfloat16	64.84->66.47
TheBloke/zephyr-7B-beta-GPTQ	GPTQ	59.22->64.04
UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3	bfloat16	67.37->69.14
UCLA-AGI/Mistral7B-PairRM-SPPO-Iter2	bfloat16	58.67->65.47
UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3	bfloat16	55.11->65.24
UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter2	bfloat16	67.62->69.40
TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T	float16	32.28->34.30
upstage/SOLAR-10.7B-Instruct-v1.0	float16	69.47->71.44
VAGOsolutions/SauerkrautLM-Nemo-12b-Instruct	bfloat16	71.63->73.64
VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct	bfloat16	68.78->70.65
uygarkurt/llama-3-merged-linear	float16	68.73->70.59
vicgalle/CarbonBeagle-11B-truthy	float16	70.46->72.42
vicgalle/ConfigurableSOLAR-10.7B	float16	68.93->70.92
Walmart-the-bag/Misted-v2-7B	float16	66.00->67.89
vicgalle/ConfigurableBeagle-11B	float16	70.57->72.54
Walmart-the-bag/Quintellect-10.7B	float16	65.28->67.13
vicgalle/CarbonBeagle-11B	float16	69.64->71.57
Weni/WeniGPT-2.4.1-Zephyr-7B-3-epochs-GPT-QA-1.0.1_DP_DPO	float16	61.64->63.40
Weni/ZeroShot-3.4.22-Mistral-7b-DPO-1.0.0	float16	63.11->64.84
Weni/ZeroShot-3.3.34-Mistral-7b-Multilanguage-3.3.0-merged	float16	63.05->64.77
Weni/WeniGPT-Mistral-7B-instructBase	float16	39.55->44.21
Weni/WeniGPT-Mistral-7B-instructBase-4bit	float16	42.14->47.44
vicgalle/ConfigurableBeagle-11B	float16	70.57->72.54
vicgalle/ConfigurableBeagle-11B	float16	70.57->72.54
yunconglong/Truthful_DPO_TomGrc_FusionNet_7Bx2_MoE_13B	bfloat16	66.99->68.84
yunconglong/DARE_TIES_13B	bfloat16	66.88->68.73
xverse/XVERSE-65B	bfloat16	53.71->55.45
yunconglong/MoE_13B_DPO	bfloat16	66.95->68.79
xverse/XVERSE-13B	bfloat16	52.59->54.40
zhengr/MixTAO-7Bx2-MoE-v8.1	bfloat16	67.44->69.26
vicgalle/ConfigurableBeagle-11B	float16	70.57->72.54

Details on which score of each model where affected by the change can be seen on this commit: https://huggingface.co/datasets/eduagarcia-temp/llm_pt_leaderboard_requests/commit/25143f35bbad78968196e31313b68744896d6d1c

Sign up or log in to comment