contamination tests.. not looking good.
#1
by
fblgit
- opened
can you share the dataset used for this 30% increased MATH score please...
I ask this because the contamination test shows https://arxiv.org/abs/2404.18824
5gram-Homer-v0.5-orgn-MATH-test.jsonl: 38.77333333333333
5gram-Homer-v0.5-orgn-MATH-train.jsonl: 47.16666666666667
vs
5gram-Qwen2.5-7B-Instruct-orgn-MATH-test.jsonl: 37.52666666666667
5gram-Qwen2.5-7B-Instruct-orgn-MATH-train.jsonl: 46.36666666666667
you can see .. when the MATH score is legit how the same contamination test looks like:
5gram-UNA-cybertron-v4-qw7B-MGS-orgn-MATH-test.jsonl: 37.42666666666667
5gram-UNA-cybertron-v4-qw7B-MGS-orgn-MATH-train.jsonl: 46.053333333333335
I looked into the samples generated, looks like re-paraphrased DPO MATH with contaminated tests. 30% MATH hard improvement while decreasing all mathematical performance in all other tests....
You can run the contamination on your own:
https://github.com/GAIR-NLP/benbench