Data Contamination Assessment
Data Contamination refers to the phenomenon where data intended for downstream testing tasks appear in the training data of large language models (LLMs), resulting in artificially inflated performance metrics in downstream tasks (such as summarization, natural language inference, text classification), which do not accurately reflect the model's true generalization capabilities.
Since the source of data contamination lies in the training data used by LLMs, the most direct method to detect data contamination is to collide test data with training data and then report the extent of overlap between the two. The classic GPT-3 paper reported on this in Table C.1.
However, today's open-source community often only publishes model parameters, not training datasets. In such cases, how to determine the presence and extent of data contamination remains unsolved. OpenCompass offers two possible solutions.
Contamination Data Annotation Based on Self-Built Co-Distribution Data
Referencing the method mentioned in Section 5.2 of Skywork, we directly used the dataset mock_gsm8k_test uploaded to HuggingFace by Skywork.
In this method, the authors used GPT-4 to synthesize data similar to the original GSM8K style, and then calculated the perplexity on the GSM8K training set (train), GSM8K test set (test), and GSM8K reference set (ref). Since the GSM8K reference set was newly generated, the authors considered it as clean, not belonging to any training set of any model. They posited:
- If the test set's perplexity is significantly lower than the reference set's, the test set might have appeared in the model's training phase;
- If the training set's perplexity is significantly lower than the test set's, the training set might have been overfitted by the model.
The following configuration file can be referenced:
from mmengine.config import read_base
with read_base():
from .datasets.gsm8k_contamination.gsm8k_contamination_ppl_ecdd22 import gsm8k_datasets # includes training, test, and reference sets
from .models.qwen.hf_qwen_7b import models as hf_qwen_7b_model # model under review
from .models.yi.hf_yi_6b import models as hf_yi_6b_model
datasets = [*gsm8k_datasets]
models = [*hf_qwen_7b_model, *hf_yi_6b_model]
An example output is as follows:
dataset version metric mode internlm-7b-hf qwen-7b-hf yi-6b-hf chatglm3-6b-base-hf qwen-14b-hf baichuan2-13b-base-hf internlm-20b-hf aquila2-34b-hf ...
--------------- --------- ----------- ------- ---------------- ------------ ---------- --------------------- ------------- ----------------------- ----------------- ---------------- ...
gsm8k-train-ppl 0b8e46 average_ppl unknown 1.5 0.78 1.37 1.16 0.5 0.76 1.41 0.78 ...
gsm8k-test-ppl 0b8e46 average_ppl unknown 1.56 1.33 1.42 1.3 1.15 1.13 1.52 1.16 ...
gsm8k-ref-ppl f729ba average_ppl unknown 1.55 1.2 1.43 1.35 1.27 1.19 1.47 1.35 ...
Currently, this solution only supports the GSM8K dataset. We welcome the community to contribute more datasets.
Consider cite the following paper if you find it helpful:
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished = {\url{https://github.com/open-compass/opencompass}},
year={2023}
}
@misc{wei2023skywork,
title={Skywork: A More Open Bilingual Foundation Model},
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
year={2023},
eprint={2310.19341},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Contamination Data Annotation Based on Classic Pre-trained Sets
Thanks to Contamination_Detector and @liyucheng09 for providing this method.
In this method, the authors search the test datasets (such as C-Eval, ARC, HellaSwag, etc.) using the Common Crawl database and Bing search engine, then mark each test sample as clean / question contaminated / both question and answer contaminated.
During testing, OpenCompass
will report the accuracy or perplexity of ceval on subsets composed of these three labels. Generally, the accuracy ranges from low to high: clean, question contaminated, both question and answer contaminated subsets. The authors believe:
- If the performance of the three is relatively close, the contamination level of the model on that test set is light; otherwise, it is heavy.
The following configuration file can be referenced link:
from mmengine.config import read_base
with read_base():
from .datasets.ceval.ceval_clean_ppl import ceval_datasets # ceval dataset with contamination tags
from .models.yi.hf_yi_6b import models as hf_yi_6b_model # model under review
from .models.qwen.hf_qwen_7b import models as hf_qwen_7b_model
from .summarizers.contamination import ceval_summarizer as summarizer # output formatting
datasets = [*ceval_datasets]
models = [*hf_yi_6b_model, *hf_qwen_7b_model]
An example output is as follows:
dataset version mode yi-6b-hf - - qwen-7b-hf - - ...
---------------------------------------------- --------- ------ ---------------- ----------------------------- --------------------------------------- ---------------- ----------------------------- --------------------------------------- ...
- - - accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated ...
...
ceval-humanities - ppl 74.42 75.00 82.14 67.44 50.00 70.54 ...
ceval-stem - ppl 53.70 57.14 85.61 47.41 52.38 67.63 ...
ceval-social-science - ppl 81.60 84.62 83.09 76.00 61.54 72.79 ...
ceval-other - ppl 72.31 73.91 75.00 58.46 39.13 61.88 ...
ceval-hard - ppl 44.35 37.50 70.00 41.13 25.00 30.00 ...
ceval - ppl 67.32 71.01 81.17 58.97 49.28 67.82 ...
Currently, this solution only supports the C-Eval, MMLU, HellaSwag and ARC. Contamination_Detector also includes CSQA and WinoGrande, but these have not yet been implemented in OpenCompass. We welcome the community to contribute more datasets.
Consider cite the following paper if you find it helpful:
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished = {\url{https://github.com/open-compass/opencompass}},
year={2023}
}
@article{Li2023AnOS,
title={An Open Source Data Contamination Report for Llama Series Models},
author={Yucheng Li},
journal={ArXiv},
year={2023},
volume={abs/2310.17589},
url={https://api.semanticscholar.org/CorpusID:264490711}
}