Code contamination in HumanEval and MBPP
What are you reporting:
- Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)
Contaminated Evaluation Dataset(s):
- openai_humaneval
- mbpp
Contaminated Corpora:
- EleutherAI/pile
- bigcode/the-stack
Approach:
- Data-based approach
- Model-based approach
Description of your method, 3-4 sentences. Evidence of data contamination:
An example in the test data (i.e., those of MBPP or HumanEval), is noted as contaminated if the aggregated similarity score is 100, i.e., a perfect match exists on the surface- or semantic-level. Levenshtein similarity score is used to measure surface-level similarity between programs and Dolos toolkit), which is a source code plagiarism detection tool for education purposes, measure semantic similarity between programs.
Note: False positives do exist despite 100% match due to either the example being too simple and obvious, or being flagged as being similar to the gold program despite being quite different.
Citation
Is there a paper that reports the data contamination or describes the method used to detect data contamination? Yes
url: https://arxiv.org/abs/2403.04811
title={Quantifying contamination in evaluating code generation capabilities of language models},
author={Riddell, Martin and Ni, Ansong and Cohan, Arman},
journal={arXiv preprint arXiv:2403.04811},
year={2024}
}
Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
Full name: Ameya Prabhu
Institution: Tübingen AI Center, University of Tübingen
Email: [email protected]
Thank you for your contribution!
Best,
Oscar