how to add special tokens?
#21
by
Moses25
- opened
I wan to add other language tokens based on this tokenizer, how to do that?
This tokenizer is a tiktoken tokenizer, so it's not using the huggingface tokenizer. If you look at model.config.vocab_size
and then len(tokenizer)
you'll see that there's a mismatch of vocab sizes. Still working on figuring out what to do about it. But that's why tokenizer.add_special_tokens() doesn't work: the implementation is tiktoken so the functionality is different.
You can take a look at their tokenizer.py file to see how to add tokens and such: https://huggingface.co/Salesforce/xgen-7b-8k-base/blob/main/tokenization_xgen.py
Moses25
changed discussion status to
closed