add flores contamination in xP3
What are you reporting:
- Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)
Evaluation dataset(s): Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. uonlp/CulturaX
), otherwise provide a link to a paper, GitHub or dataset-card.
facebook/flores
Contaminated model(s): Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. allenai/OLMo-7B
).
All models trained on bigscience/xP3
:
bigscience/bloomz
bigscience/bloomz-560m
bigscience/bloomz-1b1
bigscience/bloomz-1b7
bigscience/bloomz-3b
bigscience/bloomz-7b1
bigscience/mt0-small
bigscience/mt0-base
bigscience/mt0-large
bigscience/mt0-xl
bigscience/mt0-xxl
Contaminated corpora: Name of the corpora used to pretrain models (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace hub please write the path (e.g. CohereForAI/aya_dataset
)
bigscience/xP3
Contaminated split(s): If the dataset has Train, Development and/or Test splits please report the contaminated split(s). You can report a percentage of the dataset contaminated; if the entire dataset is compromised, report 100%.
From the xP3 paper it is unclear which split is used (dev
, devtest
, or both). I manually checked the dataset and data files with flores
all have length 997, indicating that the dev
set of same length is used.
You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.
Briefly describe your method to detect data contamination
- Data-based approach
- Model-based approach
https://arxiv.org/pdf/2304.04675 points out that "BLOOMZ is instruction- tuned with XP3 dataset (Scao et al., 2022), which includes FLORES-200 dataset.". This is mentioned in the xP3 paper as well, although very little detail is provided.
Manual inspection clearly shows that the dev
set of size 997 is included in the dataset. Even though the devtest
split is not included, it is still undesirable to train models on the dev
split.
Citation
Is there a paper that reports the data contamination or describes the method used to detect data contamination?
URL: https://aclanthology.org/2023.acl-long.891.pdf
Citation:
@inproceedings{muennighoff-etal-2023-crosslingual,
title = "Crosslingual Generalization through Multitask Finetuning",
author = "Muennighoff, Niklas and
Wang, Thomas and
Sutawika, Lintang and
Roberts, Adam and
Biderman, Stella and
Le Scao, Teven and
Bari, M Saiful and
Shen, Sheng and
Yong, Zheng Xin and
Schoelkopf, Hailey and
Tang, Xiangru and
Radev, Dragomir and
Aji, Alham Fikri and
Almubarak, Khalid and
Albanie, Samuel and
Alyafeai, Zaid and
Webson, Albert and
Raff, Edward and
Raffel, Colin",
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.891",
doi = "10.18653/v1/2023.acl-long.891",
pages = "15991--16111",
abstract = "Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks with English prompts allows for task genrealization to non-English languages that appear only in the pretraining corpus. Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results. We also investigate finetuning on multilingual tasks with prompts that have been machine-translated from English to match the language of each dataset. We find training on these machine-translated prompts leads to better performance on human-written prompts in the respective languages. Surprisingly, we find models are capable of zero-shot generalization to tasks in languages they have never intentionally seen. We conjecture that the models are learning higher-level capabilities that are both task- and language-agnostic. In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts. Our code, datasets and models are freely available at \url{https://github.com/bigscience-workshop/xmtf}.",
}
Additionally I include the paper that points out the contamination issue:
URL: https://arxiv.org/pdf/2304.04675
Citation:
@article
{zhu2023multilingual,
title={Multilingual machine translation with large language models: Empirical results and analysis},
author={Zhu, Wenhao and Liu, Hongyi and Dong, Qingxiu and Xu, Jingjing and Huang, Shujian and Kong, Lingpeng and Chen, Jiajun and Li, Lei},
journal={arXiv preprint arXiv:2304.04675},
year={2023}
}
Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
- Full name: David Stap
- Institution: University of Amsterdam
- Email: [email protected]
Hi @davidstap !
Thanks! I am not very familiar with the Flores dataset but it looks like (based on the paper) there are 2 versions: flores101 and flores200. What seems to be contaminated is the flores101 right? What I understand from "BLOOMZ is instruction-tuned with XP3 dataset (Scao et al., 2022), which includes FLORES-200 dataset" is that the training part of FLORES-200 was used for the instruction tunning, and unfortunately, it contained some (or all) examples from development split of FLORES-101.
Am I missing something?
Oscar
Hi @OSainz , thanks for your reply!
Some clarifications:
- FLORES-101 is a subset of FLORES-200. (FLORES-200 includes an additional 99 languages.)
- FLORES-200 is contaminated:
muennighoff-etal-2023-crosslingual
mention FLORES-200 in their paper. - FLORES-200 is not a training dataset, but is meant as a high-quality machine translation evaluation dataset. It has two public splits (
dev
anddevtest
, 997 and 1012 sentences, respectively) and a secret non-publictest
set. In practice, a lot of MT papers report scores on thedevtest
portion, and some usedev
as validation data. - The
facebook/flores
dataset is FLORES-200.
Does that clear up your confusions?
David
Hi @davidstap , thank you for your explanation.
I see, then we should report a 100% contamination of the "dev" set for both corpus and models. Can you add this information to the table?
Additionally, you should also add the PR number (20).
Oscar
Hi @davidstap !
Thank you again for your contribution. I made minor changes to be consistent with the rest of the entries. Now I am merging to main :)
Best,
Oscar