update contamination.csv
What are you reporting:
- Dailymail dataset found in allenai c4 dataset
Evaluation dataset(s): I have used CNN Dailymail Dataset. Path to dataset is cnn_dailymail
.
Contaminated model(s): Not Applicable
Contaminated corpora: I have used allenai c4 dataset. Path to dataset is 'allenai/c4'.
Contaminated split(s): Test split found to be 0%
You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.
Briefly describe your method to detect data contamination
Data-based approaches
I utilized a data-based approach to detect contamination in a dataset using an evaluation dataset. First, I preprocessed both datasets consistently and created an index for the training data. I then performed an exact match search for each instance in the evaluation dataset against the training index, recording any matches. After calculating and reporting the percentage of contaminated instances, I optionally checked for partial matches using n-gram overlap to identify near-duplicates.
Citation
URL: https://arxiv.org/abs/2310.20707
Citation: @misc{elazar2024whats, title={What's In My Big Data?}, author={Yanai Elazar and Akshita Bhagia and Ian Magnusson and Abhilasha Ravichander and Dustin Schwenk and Alane Suhr and Pete Walsh and Dirk Groeneveld and Luca Soldaini and Sameer Singh and Hanna Hajishirzi and Noah A. Smith and Jesse Dodge}, year={2024}, eprint={2310.20707}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
- Full name: Suryansh Sharma
- Institution: Indian Institute of Technology Kharagpur
- Email: [email protected]