contamination tests.. not looking good.

#1
by fblgit - opened

can you share the dataset used for this 30% increased MATH score please...
I ask this because the contamination test shows https://arxiv.org/abs/2404.18824

5gram-Homer-v0.5-orgn-MATH-test.jsonl: 38.77333333333333
5gram-Homer-v0.5-orgn-MATH-train.jsonl: 47.16666666666667

vs

5gram-Qwen2.5-7B-Instruct-orgn-MATH-test.jsonl: 37.52666666666667
5gram-Qwen2.5-7B-Instruct-orgn-MATH-train.jsonl: 46.36666666666667

you can see .. when the MATH score is legit how the same contamination test looks like:

5gram-UNA-cybertron-v4-qw7B-MGS-orgn-MATH-test.jsonl: 37.42666666666667
5gram-UNA-cybertron-v4-qw7B-MGS-orgn-MATH-train.jsonl: 46.053333333333335

I looked into the samples generated, looks like re-paraphrased DPO MATH with contaminated tests. 30% MATH hard improvement while decreasing all mathematical performance in all other tests....

You can run the contamination on your own:
https://github.com/GAIR-NLP/benbench

Sign up or log in to comment