Mismatching tokenizer and LLM model
Hi there,
I'm using your model and trying to decode the output of llm by provided tokenizer. It seems that the output of llm might output token_id
that is larger than the tokenizer.vocab_size
, which cause tokenizer.decode
error.
After checking the vocab_size
in both model
and tokenizer
, it is the difference of vocab_size
between configs in config.json
(51200 from this line) and tokenizer_config.json
(at most 50295 from this line).
The difference in vocab_size
while debuging also indicate this issue.
How to solve this? Is this avoidable by set some arguments or should I modify config file?
Thank you for your time for reading. Looking forward to your advices!
This is curious as it cann't be explained with added tokens. The base CodeGenTokenizer has more than 51200 tokens. Perhaps the 51200 in the model config is outdated.
It's present on the azure repo, for the latest v2, as well.
This is curious as it cann't be explained with added tokens. The base CodeGenTokenizer has more than 51200 tokens. Perhaps the 51200 in the model config is outdated.
It's present on the azure repo, for the latest v2, as well.
@wassname Thank you for your information. Do you mean there's a repo that point this issue out? Could you give me a link related to that? Thank you very much!
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2",)
tokenizer.add_tokens([f'<SPL_{i}' for i in range(0,943)]) # returns 943
adding new tokens and using the tokenizer can avoid the error
This is curious as it cann't be explained with added tokens. The base CodeGenTokenizer has more than 51200 tokens. Perhaps the 51200 in the model config is outdated.
It's present on the azure repo, for the latest v2, as well.
@wassname Thank you for your information. Do you mean there's a repo that point this issue out? Could you give me a link related to that? Thank you very much!
Oh there is the huggingface repo, and the azure one. But they both have the same discrepancy.
Could you please provide the script which is generating those identifiers?
We ended up setting 51200 as the vocabulary size just to accommodate any new tokens that we might need in the future. You can follow @Deepakvictor answer and it should fix the issue.
As far as I know, no tokens from 50295+ should be generated because those embeddings were not trained. Though, depending on the generation's parameters, they could appear (low probabilities however).