File size: 865 Bytes
d1a3793 119d28a 33886d8 cc2d5e3 d1a3793 119d28a 33886d8 119d28a 7357b8a 119d28a 3b07e52 119d28a 33886d8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
---
license: mit
language:
- en
tags:
- babylm
- tokenizer
datasets:
- nilq/babylm-100M
---
## Baby Tokenizer
Compact sentencepiece tokenizer for sample-efficient English language modeling, simply tokenizing natural language.
### Usage
#### Transformers
```py
from transformers import AutoTokenizer
tokenizer_baby = AutoTokenizer.from_pretrained("nilq/baby-tokenizer")
```
#### Tokenizers
```py
from tokenizers import Tokenizer
tokenizer_baby = Tokenizer.from_pretrained("nilq/baby-tokenizer")
```
### Data
This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources:
- CHILDES (child-directed speech)
- Subtitles (speech)
- BNC (speech)
- TED talks (speech)
- children's books (simple written language).
### Specifications
- Vocabulary size: 20k
- Alphabet limit: 150
- Minimum token frequency: 100 |