File size: 821 Bytes
690d18f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
This model has not been trained on any Cantonese material.

It is simply a base model in which the embeddings and tokenizer were patched with Cantonese characters. One can find the original model [gpt2-tiny-chinese](https://huggingface.co/ckiplab/gpt2-tiny-chinese).






I used this [repo](https://github.com/ayaka14732/bert-tokenizer-cantonese) to identify missing Cantonese characters

[My forked and modified version](https://github.com/jedcheng/bert-tokenizer-cantonese)

After identifying the missing characters, the Huggingface library provides very high level API to modify the tokenizer and embeddings. 

```
Download a tokenizer and a model from the Huggingface library. Then: 

tokenizer.add_tokens("your new tokens")
model.resize_token_embeddings(len(tokenizer))

tokenizer.push_to_hub("your model name")
```