Spaces:

CONDA-Workshop
/

Data-Contamination-Database

Running

App Files Files Community

Added Contamination Evidence from GPT4 Tech Report using String matching on GPT-4

#11

by AmeyaPrabhu - opened Apr 24

base: refs/heads/main

←

from: refs/pr/11

Discussion Files changed

+189

-20

AmeyaPrabhu

Apr 24

•

edited Apr 25

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Contaminated Evaluation Dataset(s):

openai_humaneval
ucinlp/drop
cais/mmlu
gsm8k
ibragim-bad/arc_challenge
winogrande
BigBench evaluation was determined to be contaminated badly, but numbers are not specified so assuming 100%.
GSM-8k and MATH training were contaminated but numbers are not specified so assuming 100%.

Contaminated model(s): GPT-4

Approach:

Data-based approach
Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination:

OpenAI tech report measures cross-contamination between our evaluation dataset and the pre-training data using substring match. Both evaluation and training data are processed by removing all spaces and symbols, 28 keeping only characters (including numbers). For each evaluation example, they randomly select three substrings of 50 characters (or use the entire example if it’s less than 50 characters). A match is identified if any of the three sampled evaluation substrings is a substring of the processed training example. This yields a list of contaminated examples.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination? Yes

url: https://arxiv.org/abs/2303.08774

  title={Gpt-4 technical report},
  author={Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others},
  journal={arXiv preprint arXiv:2303.08774},
  year={2023}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Ameya Prabhu
Institution: Tübingen AI Center, University of Tübingen
Email: [email protected]

Update contamination_report.csv501d7b66

Update contamination_report.csvea371e52

Update contamination_report.csv5461dcb2

AmeyaPrabhu

Apr 24

This comment has been hidden

AmeyaPrabhu

Apr 24

Correction-- Contaminated Evaluation Dataset(s) and not Contaminated Corpora

Somehow missed this! Really sorry, I will iron out these minor issues in next commits!

AmeyaPrabhu changed pull request title from Update contamination_report.csv to Added Contamination Evidence from GPT4 Tech Report using String matching on GPT-4 Apr 24

OSainz

Workshop on Data Contamination org Apr 25

Hi @AmeyaPrabhu !

I see that a few more datasets are reported in the paper, do you plan to add those too?

Best,
Oscar

AmeyaPrabhu

Apr 25

Hi Oscar,

Yes, I should add them. I changed my thresholds from reporting only major contamination to all recently. One question-- Should I add the non-academic benchmarks too? (Table 9 and 10 in the paper)

OSainz

Workshop on Data Contamination org Apr 25

Are those exams available for other teams to perform comparative evaluations?

AmeyaPrabhu

Apr 25

•

edited Apr 25

Yes, most of the data sources are documented in the GPT4 tech report and the questions themselves are publicly available (or commercial textbooks) however the evaluation methodology is unclear as they used 3rd party contractors to grade them. Would it be worth the effort to add these ones? Otherwise I can just add the academic benchmarks for now.

For reference: Claude 3 models compare with GPT4 on these non-academic benchmarks (on a subset of them). However, Claude 3 does not provide any evidence on contamination with training set which is sad.

OSainz

Workshop on Data Contamination org Apr 25

I think that we can skip them for now. As you mention the methodology is unclear.

Update contamination_report.csv47ead683

AmeyaPrabhu

Apr 25

Added the remaining academic benchmarks, and updated the report alongside with the new benchmarks added. Should be ready to merge!

Merge branch 'main' of https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report into pr/11ec0bb5d4

OSainz

Workshop on Data Contamination org Apr 29

Hi @AmeyaPrabhu !

Thank you for your contribution. Merging to main.

Best,
Oscar

OSainz changed pull request status to merged Apr 29

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment