From Babble to Words - a phonemetransformers Collection

updated 25 days ago

The models, tokenizers and datasets used for our submission for BabyLM 2024, investigating the viability of training LLMs on phoneme streams.

Upvote

phonemetransformers/BabyLM-phonemized

Viewer • Updated 25 days ago • 12.5M • 38
phonemetransformers/BABYLM-TOKENIZER-CHAR-PHON

Updated Sep 24

Note Tokenizer trained on BabyLM dataset that uses character-based tokenization for phonemic text. Word boundaries are not removed.
phonemetransformers/BABYLM-TOKENIZER-BPE-PHON

Updated Sep 24

Note Tokenizer trained on BabyLM dataset that uses BPE tokenization for phonemic text. Word boundaries are not removed.
phonemetransformers/BABYLM-TOKENIZER-CHAR-TXT

Updated Sep 24

Note Tokenizer trained on BabyLM dataset that uses character-based tokenization for orthographic text. Word boundaries are not removed.
phonemetransformers/BABYLM-TOKENIZER-BPE-TXT

Updated Sep 24

Note Tokenizer trained on BabyLM dataset that uses BPE tokenization for orthographic text. Word boundaries are not removed.
phonemetransformers/BABYLM-TOKENIZER-BPE-PHON-SPACELESS

Updated Sep 24

Note Tokenizer trained on BabyLM dataset that uses BPE tokenization for phonemic text. Word boundaries are removed.
phonemetransformers/BABYLM-TOKENIZER-CHAR-TXT-SPACELESS

Updated Sep 24

Note Tokenizer trained on BabyLM dataset that uses character-based tokenization for orthographic text. Word boundaries are removed.
phonemetransformers/BABYLM-TOKENIZER-BPE-TXT-SPACELESS

Updated Sep 24

Note Tokenizer trained on BabyLM dataset that uses BPE tokenization for orthographic text. Word boundaries are removed.
phonemetransformers/BABYLM-TOKENIZER-CHAR-PHON-SPACELESS

Updated Sep 24

Note Tokenizer trained on BabyLM dataset that uses character-based tokenization for phonemic text. Word boundaries are removed.
phonemetransformers/GPT2-85M-BPE-PHON

Updated Sep 12 • 11

Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON tokenizer.
phonemetransformers/GPT2-85M-BPE-PHON-SPACELESS

Updated Sep 12 • 7

Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT-SPACELESS

Updated Sep 12 • 1

Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON

Updated Sep 12 • 4

Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON-SPACELESS

Updated Sep 12 • 1

Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT

Updated Sep 12 • 9

Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT-SPACELESS

Updated Sep 12 • 5

Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT

Updated Sep 11 • 11

Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT tokenizer.

Upvote