File size: 865 Bytes
d1a3793
 
119d28a
 
33886d8
 
 
cc2d5e3
 
d1a3793
119d28a
 
 
33886d8
119d28a
7357b8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119d28a
 
 
 
3b07e52
 
119d28a
 
 
 
 
 
 
33886d8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
license: mit
language:
- en
tags:
- babylm
- tokenizer
datasets:
- nilq/babylm-100M
---

## Baby Tokenizer

Compact sentencepiece tokenizer for sample-efficient English language modeling, simply tokenizing natural language.

### Usage

#### Transformers

```py
from transformers import AutoTokenizer

tokenizer_baby = AutoTokenizer.from_pretrained("nilq/baby-tokenizer")
```

#### Tokenizers

```py
from tokenizers import Tokenizer

tokenizer_baby = Tokenizer.from_pretrained("nilq/baby-tokenizer")
```

### Data

This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources:
- CHILDES (child-directed speech)
- Subtitles (speech)
- BNC (speech)
- TED talks (speech)
- children's books (simple written language).

### Specifications

- Vocabulary size: 20k
- Alphabet limit: 150
- Minimum token frequency: 100