jed351 commited on
Commit
690d18f
1 Parent(s): a0545b7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -0
README.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This model has not been trained on any Cantonese material.
2
+
3
+ It is simply a base model in which the embeddings and tokenizer were patched with Cantonese characters. One can find the original model [gpt2-tiny-chinese](https://huggingface.co/ckiplab/gpt2-tiny-chinese).
4
+
5
+
6
+
7
+
8
+
9
+
10
+ I used this [repo](https://github.com/ayaka14732/bert-tokenizer-cantonese) to identify missing Cantonese characters
11
+
12
+ [My forked and modified version](https://github.com/jedcheng/bert-tokenizer-cantonese)
13
+
14
+ After identifying the missing characters, the Huggingface library provides very high level API to modify the tokenizer and embeddings.
15
+
16
+ ```
17
+ Download a tokenizer and a model from the Huggingface library. Then:
18
+
19
+ tokenizer.add_tokens("your new tokens")
20
+ model.resize_token_embeddings(len(tokenizer))
21
+
22
+ tokenizer.push_to_hub("your model name")
23
+ ```