Update contamination_report.csv
Browse files## What are you reporting:
- [ ] Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- [x] Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)
**Evaluation dataset(s)**: Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. `uonlp/CulturaX`), otherwise provide a link to a paper, GitHub or dataset-card.
gsm8k
hendrycks/competition_math
**Contaminated model(s)**: Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. `allenai/OLMo-7B`).
Contaminated model in gsm8k:
`Qwen/Qwen-1_8B`, `Qwen/Qwen-14B`
Contaminated model in hendrycks/competition_math:
`BAAI/Aquila2-34B`, `BAAI/Aquila2-7B`, `Qwen/Qwen-1_8B`, `Qwen/Qwen-7B`, `Qwen/Qwen-14B`, `THUDM/chatglm3-6b`, `internlm/internlm2-7b`, `internlm/internlm2-20b`
**Contaminated corpora**: Name of the corpora used to pretrain models (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace hub please write the path (e.g. `CohereForAI/aya_dataset`)
None
**Contaminated split(s)**: If the dataset has Train, Development and/or Test splits please report the contaminated split(s). You can report a percentage of the dataset contaminated; if the entire dataset is compromised, report 100%.
The train and test split. Specific percentage numbers are update in the csv file.
> You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.
## Briefly describe your method to detect data contamination
- [ ] Data-based approach
- [x] Model-based approach
Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
N-gram Accuracy. Specifically, combine the problem and solution into a combined text. Then, evenly select k (e.g., 5) starting points within this sequence. Use the content before each selected point as a prompt to predict the subsequent n-grams (e.g., 5-grams). If the model can correctly predict all the n-grams, it indicates that the model has encountered this data during training. Refer to the paper "Benchmarking Benchmark Leakage in Large Language Models" (https://arxiv.org/pdf/2404.18824).
![image.png](https://cdn-uploads.huggingface.co/production/uploads/62cbeb2d72dfd24b86bdf977/5faGVtQG3khjLv38IpOdm.png)
#### Data-based approaches
Data-based approaches identify evidence of data contamination in a pre-training corpus by directly examining the dataset for instances of the evaluation data. This method involves algorithmically searching through a large pre-training dataset to find occurrences of the evaluation data. You should provide evidence of data contamination in the form: "dataset X appears in line N of corpus Y," "dataset X appears N times in corpus Y," or "N examples from dataset X appear in corpus Y."
#### Model-based approaches
Model-based approaches, on the other hand, utilize heuristic algorithms to infer the presence of data contamination in a pre-trained model. These methods do not directly analyze the data but instead assess the model's behavior to predict data contamination. Examples include prompting the model to reproduce elements of an evaluation dataset to demonstrate memorization (i.e https://hitz-zentroa.github.io/lm-contamination/blog/) or using perplexity measures to estimate data contamination (). You should provide evidence of data contamination in the form of evaluation results of the algorithm from research papers, screenshots of model outputs that demonstrate memorization of a pre-training dataset, or any other form of evaluation that substantiates the method's effectiveness in detecting data contamination. You can provide a confidence score in your predictions.
## Citation
Is there a paper that reports the data contamination or describes the method used to detect data contamination?
URL: `https://arxiv.org/pdf/2404.18824`
Citation: `@article{xu2024benchmarking,
title={Benchmarking Benchmark Leakage in Large Language Models},
author={Xu, Ruijie and Wang, Zengzhi and Fan, Run-Ze and Liu, Pengfei},
year={2024},
journal={arXiv preprint arXiv:2404.18824},
url={https://arxiv.org/abs/2404.18824}
}`
*Important!* If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
- Full name: Ruijie Xu
- Institution: Shanghai Jiao Tong University
- Email: [email protected]
- Full name: Zengzhi Wang
- Institution: Shanghai Jiao Tong University
- Email: [email protected]
- Full name: Run-Ze Fan
- Institution: Shanghai Jiao Tong University
- Email: [email protected]
- Full name: Pengfei Liu
- Institution: Shanghai Jiao Tong University
- Email: [email protected]
> Note that all listed contributors are the authors of the reference paper (Benchmarking Benchmark Leakage in Large Language Models).
- contamination_report.csv +13 -0
@@ -168,6 +168,9 @@ gsm8k;;EleutherAI/llemma_7b;;model;;;0.15;data-based;https://openreview.net/foru
|
|
168 |
gsm8k;;EleutherAI/proof-pile-2;;corpus;;;0.15;data-based;https://openreview.net/forum?id=4WnqRR915j;23
|
169 |
gsm8k;;GPT-4;;model;100.0;;1.0;data-based;https://arxiv.org/abs/2303.08774;11
|
170 |
gsm8k;;GPT-4;;model;79.00;;;model-based;https://arxiv.org/abs/2311.06233;8
|
|
|
|
|
|
|
171 |
|
172 |
head_qa;en;EleutherAI/pile;;corpus;;;5.11;data-based;https://arxiv.org/abs/2310.20707;2
|
173 |
head_qa;en;allenai/c4;;corpus;;;5.22;data-based;https://arxiv.org/abs/2310.20707;2
|
@@ -182,6 +185,16 @@ health_fact;;togethercomputer/RedPajama-Data-V2;;corpus;;;18.7;data-based;https:
|
|
182 |
hendrycks/competition_math;;EleutherAI/llemma_34b;;model;;;7.72;data-based;https://openreview.net/forum?id=4WnqRR915j;23
|
183 |
hendrycks/competition_math;;EleutherAI/llemma_7b;;model;;;7.72;data-based;https://openreview.net/forum?id=4WnqRR915j;23
|
184 |
hendrycks/competition_math;;EleutherAI/proof-pile-2;;corpus;;;7.72;data-based;https://openreview.net/forum?id=4WnqRR915j;23
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
185 |
|
186 |
hlgd;;EleutherAI/pile;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
|
187 |
hlgd;;allenai/c4;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
|
|
|
168 |
gsm8k;;EleutherAI/proof-pile-2;;corpus;;;0.15;data-based;https://openreview.net/forum?id=4WnqRR915j;23
|
169 |
gsm8k;;GPT-4;;model;100.0;;1.0;data-based;https://arxiv.org/abs/2303.08774;11
|
170 |
gsm8k;;GPT-4;;model;79.00;;;model-based;https://arxiv.org/abs/2311.06233;8
|
171 |
+
gsm8k;;Qwen/Qwen-1_8B;;model;12.8;;0.075;model-based;https://arxiv.org/abs/2404.18824;27
|
172 |
+
gsm8k;;Qwen/Qwen-14B;;model;0.5;;;model-based;https://arxiv.org/abs/2404.18824;27
|
173 |
+
|
174 |
|
175 |
head_qa;en;EleutherAI/pile;;corpus;;;5.11;data-based;https://arxiv.org/abs/2310.20707;2
|
176 |
head_qa;en;allenai/c4;;corpus;;;5.22;data-based;https://arxiv.org/abs/2310.20707;2
|
|
|
185 |
hendrycks/competition_math;;EleutherAI/llemma_34b;;model;;;7.72;data-based;https://openreview.net/forum?id=4WnqRR915j;23
|
186 |
hendrycks/competition_math;;EleutherAI/llemma_7b;;model;;;7.72;data-based;https://openreview.net/forum?id=4WnqRR915j;23
|
187 |
hendrycks/competition_math;;EleutherAI/proof-pile-2;;corpus;;;7.72;data-based;https://openreview.net/forum?id=4WnqRR915j;23
|
188 |
+
hendrycks/competition_math;;BAAI/Aquila2-34B;;model;3.366;;1.166;model-based;https://arxiv.org/pdf/2404.18824;27
|
189 |
+
hendrycks/competition_math;;BAAI/Aquila2-7B;;model;1;;0.133;model-based;https://arxiv.org/pdf/2404.18824;27
|
190 |
+
hendrycks/competition_math;;Qwen/Qwen-1_8B;;model;4.533;;1.70;model-based;https://arxiv.org/pdf/2404.18824;27
|
191 |
+
hendrycks/competition_math;;Qwen/Qwen-7B;;model;1.266;;0.766;model-based;https://arxiv.org/pdf/2404.18824;27
|
192 |
+
hendrycks/competition_math;;Qwen/Qwen-14B;;model;1.766;;1.6;model-based;https://arxiv.org/pdf/2404.18824;27
|
193 |
+
hendrycks/competition_math;;THUDM/chatglm3-6b;;model;0.70;;0.4;model-based;https://arxiv.org/pdf/2404.18824;27
|
194 |
+
hendrycks/competition_math;;internlm/internlm2-7b;;model;3.033;;0.433;model-based;https://arxiv.org/pdf/2404.18824;27
|
195 |
+
hendrycks/competition_math;;internlm/internlm2-20b;;model;4.733;;0.666;model-based;https://arxiv.org/pdf/2404.18824;27
|
196 |
+
|
197 |
+
|
198 |
|
199 |
hlgd;;EleutherAI/pile;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
|
200 |
hlgd;;allenai/c4;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
|