configs do not match tokenizer vocab size
#5
by
carson-together
- opened
I think there is a mismatch between the number of tokens in the tokenizer vocab size and config.json
When loading the tokenizer:tokenizer = AutoTokenizer.from_pretrained("NousResearch/Nous-Hermes-2-Yi-34B")
we encounter this warning:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are converting a LlamaTokenizer to a LlamaTokenizerFast, but wrong indexes were founds when adding the `added_tokens` from the `slow` tokenizer to the `fast`. The following tokens had unexpected id :
expected id: 64000, found: 1, token: `<|startoftext|>`,
expected id: 64001, found: 2, token: `<|endoftext|>`,
. You should try using `from_slow`.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The tokenizer vocab size is 64002len(tokenizer)
64002
but config.json reports the vocab size is 64000
"use_cache": false,
"vocab_size": 64000
}
Yeah.
Mergekit gives this warning:
WARNING:root:Token '<|startoftext|>' present in /home/alpha/Models/Raw/NousResearch_Nous-Hermes-2-Yi-34B tokenizer but >= vocab_size
WARNING:root:Token '<|endoftext|>' present in /home/alpha/Models/Raw/NousResearch_Nous-Hermes-2-Yi-34B tokenizer but >= vocab_size
Correcting the vocab size to 64002 doesn't seem to work either.