Wrong `eos_token` in `special_tokens_map.json`?

#22

by saakethn - opened Jul 23

Jul 23

Was looking at the tokenizer configs and it seems that the special_tokens_map.json file has:

"eos_token": "<|eot_id|>"

This corresponds to token 128009. However, the 8b base, 70b base & instruct, and 405b base & instruct all have:

"eos_token": "<|end_of_text|>"

which corresponds to token 128001. Is this incorrectly set for the 8b instruct model?

Thanks!

CHNtentes

Jul 24

In config.json:

"eos_token_id": [
128001,
128008,
128009
]

cipolee

Jul 24

I think it's reasonable for different models (base, instruct, chat) to have different eos_tokens. The reason might be that the "<|end_of_text|>" token is set as the end-of-document marker during the pre-training, and this token is retained in this series of models. Additionally, during Continual training, other special tokens can be added as eos_token. For example, in conversations, a new special token can be introduced to mark the end of a utterance.

saakethn

Jul 24

Ok seems like they fixed the discrepancy. The Instruct models now all have eos token corresponding to 128009 and the base models all have 128001.

saakethn changed discussion status to closed Jul 24

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment