CONDA-Workshop/Data-Contamination-Database · Code contamination in HumanEval and MBPP

Update contamination_report.csvcbab33e3

AmeyaPrabhu

Apr 24

•

edited Apr 24

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Contaminated Evaluation Dataset(s):

openai_humaneval
mbpp

Contaminated Corpora:

EleutherAI/pile
bigcode/the-stack

Approach:

Data-based approach
Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination:

An example in the test data (i.e., those of MBPP or HumanEval), is noted as contaminated if the aggregated similarity score is 100, i.e., a perfect match exists on the surface- or semantic-level. Levenshtein similarity score is used to measure surface-level similarity between programs and Dolos toolkit), which is a source code plagiarism detection tool for education purposes, measure semantic similarity between programs.

Note: False positives do exist despite 100% match due to either the example being too simple and obvious, or being flagged as being similar to the gold program despite being quite different.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination? Yes

url: https://arxiv.org/abs/2403.04811

  title={Quantifying contamination in evaluating code generation capabilities of language models},
  author={Riddell, Martin and Ni, Ansong and Cohan, Arman},
  journal={arXiv preprint arXiv:2403.04811},
  year={2024}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Ameya Prabhu
Institution: Tübingen AI Center, University of Tübingen
Email: [email protected]

Update contamination_report.csv04c55d99

AmeyaPrabhu changed pull request title from Update contamination_report.csv to Code contamination in HumanEval and MBPP Apr 24

OSainz

Workshop on Data Contamination org Apr 25

Thank you for your contribution!

Best,
Oscar

OSainz changed pull request status to merged Apr 25