jed351
/

gpt2-base-zh-hk

Feature Extraction

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

jed351 commited on Feb 2, 2023

Commit

690d18f

•

1 Parent(s): a0545b7

Create README.md

Files changed (1) hide show

README.md +23 -0

README.md ADDED Viewed

	@@ -0,0 +1,23 @@

+This model has not been trained on any Cantonese material.
+It is simply a base model in which the embeddings and tokenizer were patched with Cantonese characters. One can find the original model [gpt2-tiny-chinese](https://huggingface.co/ckiplab/gpt2-tiny-chinese).
+I used this [repo](https://github.com/ayaka14732/bert-tokenizer-cantonese) to identify missing Cantonese characters
+[My forked and modified version](https://github.com/jedcheng/bert-tokenizer-cantonese)
+After identifying the missing characters, the Huggingface library provides very high level API to modify the tokenizer and embeddings.
+```
+Download a tokenizer and a model from the Huggingface library. Then:
+tokenizer.add_tokens("your new tokens")
+model.resize_token_embeddings(len(tokenizer))
+tokenizer.push_to_hub("your model name")
+```