Wrong `eos_token` in `special_tokens_map.json`?

#22
by saakethn - opened

Was looking at the tokenizer configs and it seems that the special_tokens_map.json file has:

"eos_token": "<|eot_id|>"

This corresponds to token 128009. However, the 8b base, 70b base & instruct, and 405b base & instruct all have:

"eos_token": "<|end_of_text|>"

which corresponds to token 128001. Is this incorrectly set for the 8b instruct model?

Thanks!

In config.json:

"eos_token_id": [
128001,
128008,
128009
]

I think it's reasonable for different models (base, instruct, chat) to have different eos_tokens. The reason might be that the "<|end_of_text|>" token is set as the end-of-document marker during the pre-training, and this token is retained in this series of models. Additionally, during Continual training, other special tokens can be added as eos_token. For example, in conversations, a new special token can be introduced to mark the end of a utterance.

Ok seems like they fixed the discrepancy. The Instruct models now all have eos token corresponding to 128009 and the base models all have 128001.

saakethn changed discussion status to closed

Sign up or log in to comment