Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
This model has not been trained on any Cantonese material.
|
2 |
+
|
3 |
+
It is simply a base model in which the embeddings and tokenizer were patched with Cantonese characters. One can find the original model [gpt2-tiny-chinese](https://huggingface.co/ckiplab/gpt2-tiny-chinese).
|
4 |
+
|
5 |
+
|
6 |
+
|
7 |
+
|
8 |
+
|
9 |
+
|
10 |
+
I used this [repo](https://github.com/ayaka14732/bert-tokenizer-cantonese) to identify missing Cantonese characters
|
11 |
+
|
12 |
+
[My forked and modified version](https://github.com/jedcheng/bert-tokenizer-cantonese)
|
13 |
+
|
14 |
+
After identifying the missing characters, the Huggingface library provides very high level API to modify the tokenizer and embeddings.
|
15 |
+
|
16 |
+
```
|
17 |
+
Download a tokenizer and a model from the Huggingface library. Then:
|
18 |
+
|
19 |
+
tokenizer.add_tokens("your new tokens")
|
20 |
+
model.resize_token_embeddings(len(tokenizer))
|
21 |
+
|
22 |
+
tokenizer.push_to_hub("your model name")
|
23 |
+
```
|