Tokenizer seemingly missing FIM tokens
according to the mistral reference implementation the FIM tokens should be [PREFIX]
, [MIDDLE]
, and [SUFFIX]
as can be seen here https://github.com/mistralai/mistral-common/blob/main/src/mistral_common/tokens/tokenizers/base.py#L20-L22
These are apparently not in the converted tokenizer
Ok, apparently this is, intended and matches up with the official implementation, I had naively assumed that these would be separate tokens like the instruct tokens
Ok, apparently this is, intended and matches up with the official implementation, I had naively assumed that these would be separate tokens like the instruct tokens
Are you saying that the tokens are default [PREFIX], [MIDDLE], and [SUFFIX], so they don't have to show up in tokenizer config?
The mistral community has converted it too and it doesn't appear on their tokenizer config:
https://huggingface.co/mistral-community/Codestral-22B-v0.1/blob/main/tokenizer_config.json
i've updated the convert script to fix the tokenizer and re-did the HF model https://huggingface.co/legraphista/Codestral-22B-v0.1-hf-FIM-fix
Thanks! I just updated the model with the fix :)