2 1 1

Catherine Arnett

catherinearnett

https://catherinearnett.github.io/

AI & ML interests

multilingual NLP, tokenization

Recent Activity

upvoted an article 11 days ago

Releasing the largest multilingual open pretraining dataset

authored a paper 19 days ago

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

authored a paper 19 days ago

Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models

View all activity

Articles

Organizations

catherinearnett's activity

upvoted an article 11 days ago

Article

Releasing the largest multilingual open pretraining dataset

•

11 days ago

• 94

authored 2 papers 19 days ago

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Paper • 2409.04599 • Published Sep 6 • 1

Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models

Paper • 2311.09194 • Published Nov 15, 2023

updated a model 20 days ago

PleIAs/OCRonos-Vintage-CT2

Updated 20 days ago • 5

New activity in PleIAs/ToxicCommons 21 days ago

Link to the annotation creation scrip private

#2 opened 23 days ago by

davanstrien

updated a dataset 21 days ago

PleIAs/ToxicCommons

Viewer • Updated 21 days ago • 1.96M • 92 • 6

updated a model 21 days ago

PleIAs/celadon

Text Classification • Updated 21 days ago • 274 • 16

authored a paper 24 days ago

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Paper • 2410.22587 • Published 25 days ago • 8

commented a paper 24 days ago

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Paper • 2410.22587 • Published 25 days ago • 8 •

updated a collection 25 days ago

Toxic Commons

Collection

Tools for de-toxifying public domain data, especially multilingual and historical text data and data with OCR errors. • 3 items • Updated 24 days ago • 2

published an article about 2 months ago

Article

wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??

•

Sep 27

• 35

updated 8 models about 2 months ago

catherinearnett/B-GPT_pl_en_sequential

Text Generation • Updated Sep 26 • 514

catherinearnett/B-GPT_en_pl_sequential

Text Generation • Updated Sep 26 • 512

catherinearnett/B-GPT_pl_en_simultaneous

Text Generation • Updated Sep 26 • 508

catherinearnett/B-GPT_en_pl_simultaneous

Text Generation • Updated Sep 26 • 672

catherinearnett/B-GPT_el_en_sequential

Text Generation • Updated Sep 26 • 511

catherinearnett/B-GPT_en_el_sequential

Text Generation • Updated Sep 26 • 517

catherinearnett/B-GPT_el_en_simultaneous

Text Generation • Updated Sep 26 • 516

catherinearnett/B-GPT_en_el_simultaneous

Text Generation • Updated Sep 26 • 514

Catherine Arnett

AI & ML interests

Recent Activity

Articles

Releasing the largest multilingual open pretraining dataset

Detoxifying the Commons

wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??

Organizations

catherinearnett's activity

Releasing the largest multilingual open pretraining dataset

Link to the annotation creation scrip private

wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??