Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
lbourdois 
posted an update Feb 1

I did not expect that many datasets to have such notable issues! Very interesting, thanks for sharing.
I would also be interested in the data quality bot that you describe at the end - I think that would be quite useful.

·

It's the exchanges I've had with you that have led me to question the quality of the data 🤗

On which desk in the Paris office should I leave a post-it note asking for the creation of the bot?

Pretty cool stuff! Maybe you should do a leaderboard of major datasets and their leakage score

A little glossary would be nice, I'm not even sure what NER is or what a "leak" means.

·

For NER (Name Entity Recognition) you can consult https://huggingface.co/tasks/token-classification.
A leak is when data of the train split is found in the test split, biasing the results and benchmarks.