view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais • 11 days ago • 94
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training Paper • 2409.04599 • Published Sep 6 • 1
Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models Paper • 2311.09194 • Published Nov 15, 2023
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published 25 days ago • 8
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published 25 days ago • 8 • 2
Toxic Commons Collection Tools for de-toxifying public domain data, especially multilingual and historical text data and data with OCR errors. • 3 items • Updated 24 days ago • 2