Wrong `eos_token` in `special_tokens_map.json`?
Was looking at the tokenizer configs and it seems that the special_tokens_map.json
file has:
"eos_token": "<|eot_id|>"
This corresponds to token 128009. However, the 8b base, 70b base & instruct, and 405b base & instruct all have:
"eos_token": "<|end_of_text|>"
which corresponds to token 128001. Is this incorrectly set for the 8b instruct model?
Thanks!
In config.json:
"eos_token_id": [
128001,
128008,
128009
]
I think it's reasonable for different models (base, instruct, chat) to have different eos_tokens. The reason might be that the "<|end_of_text|>" token is set as the end-of-document marker during the pre-training, and this token is retained in this series of models. Additionally, during Continual training, other special tokens can be added as eos_token. For example, in conversations, a new special token can be introduced to mark the end of a utterance.
Ok seems like they fixed the discrepancy. The Instruct models now all have eos token corresponding to 128009 and the base models all have 128001.