CONDA-Workshop/Data-Contamination-Database · Superglue/RealNews Contamination based on "Noise-Robust De-Duplication at Scale"

Apr 26

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): superglue

Contaminated corpora: allenai/c4 - we only look at the realnewslike variant

Contaminated split(s):

Subset	Contamination
`super_glue (boolq)`	0.6 %
`super_glue (cb)`	0.0%
`super_glue (copa)`	0.0%
`super_glue (multirc)`	1.2%
`super_glue (record)`	7.3%
`super_glue (rte)`	1.1%
`super_glue (wic)`	0.0%
`super_glue (wsc)`	0.0%

Briefly describe your method to detect data contamination

Data-based approach
Model-based approach

We contrastively train a bi-encoder on noisy duplicates. We find that the neural approach finds many duplicates that are missed by rule-based approaches like hashing.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URLs: https://openreview.net/forum?id=bAz2DBS35i, https://arxiv.org/abs/2210.04261
Citation:

@inproceedings{silcock-etal-2020-noise,
  title = "Noise-Robust De-Duplication at Scale",
  author = "Silcock, Emily and D'Amico-Wong, Luca and Yang, Jinglin and Dell, Melissa",
  booktitle = "International Conference on Learning Representations (ICLR)",
  year = "2023",
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full names: Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell
Institution: Harvard University
Email: [email protected], [email protected], [email protected]

Superglue/RealNews Contamination based on "Noise-Robust De-Duplication at Scale"ec46e45d

Merge branch 'main' of https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report into pr/15854619bb

OSainz

Workshop on Data Contamination org Apr 29

Hi @emilys !

Thank you for your contribution! Merging to main.

Best,
Oscar

OSainz changed pull request status to merged Apr 29