From Babble to Words
The models, tokenizers and datasets used for our submission for BabyLM 2024, investigating the viability of training LLMs on phoneme streams.
- Viewer • Updated • 12.5M • 38
phonemetransformers/BABYLM-TOKENIZER-CHAR-PHON
UpdatedNote Tokenizer trained on BabyLM dataset that uses character-based tokenization for phonemic text. Word boundaries are not removed.
phonemetransformers/BABYLM-TOKENIZER-BPE-PHON
UpdatedNote Tokenizer trained on BabyLM dataset that uses BPE tokenization for phonemic text. Word boundaries are not removed.
phonemetransformers/BABYLM-TOKENIZER-CHAR-TXT
UpdatedNote Tokenizer trained on BabyLM dataset that uses character-based tokenization for orthographic text. Word boundaries are not removed.
phonemetransformers/BABYLM-TOKENIZER-BPE-TXT
UpdatedNote Tokenizer trained on BabyLM dataset that uses BPE tokenization for orthographic text. Word boundaries are not removed.
phonemetransformers/BABYLM-TOKENIZER-BPE-PHON-SPACELESS
UpdatedNote Tokenizer trained on BabyLM dataset that uses BPE tokenization for phonemic text. Word boundaries are removed.
phonemetransformers/BABYLM-TOKENIZER-CHAR-TXT-SPACELESS
UpdatedNote Tokenizer trained on BabyLM dataset that uses character-based tokenization for orthographic text. Word boundaries are removed.
phonemetransformers/BABYLM-TOKENIZER-BPE-TXT-SPACELESS
UpdatedNote Tokenizer trained on BabyLM dataset that uses BPE tokenization for orthographic text. Word boundaries are removed.
phonemetransformers/BABYLM-TOKENIZER-CHAR-PHON-SPACELESS
UpdatedNote Tokenizer trained on BabyLM dataset that uses character-based tokenization for phonemic text. Word boundaries are removed.
phonemetransformers/GPT2-85M-BPE-PHON
Updated • 11Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON tokenizer.
phonemetransformers/GPT2-85M-BPE-PHON-SPACELESS
Updated • 7Note GPT2 with 85M non-embedding parameters trained using the BPE-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT-SPACELESS
Updated • 1Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON
Updated • 4Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON tokenizer.
phonemetransformers/GPT2-85M-CHAR-PHON-SPACELESS
Updated • 1Note GPT2 with 85M non-embedding parameters trained using the CHAR-PHON-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-CHAR-TXT
Updated • 9Note GPT2 with 85M non-embedding parameters trained using the CHAR-TXT tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT-SPACELESS
Updated • 5Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT-SPACELESS tokenizer.
phonemetransformers/GPT2-85M-BPE-TXT
Updated • 11Note GPT2 with 85M non-embedding parameters trained using the BPE-TXT tokenizer.