microsoft/Phi-3-mini-4k-instruct · Update tokenizer

Apr 25

It seems eos_token_id is <|end|> (32007) instead of <|endoftext|> (32000).
Context: https://twitter.com/altryne/status/1783567596467491109?t=k5HHVmTCGDt4-TkXF8KyNw&s=19

Update tokenizer_config.json0984ffcb

Trist4x

Apr 26

•

edited Apr 26

Hey, thanks for proposing this fix :)
I tried to follow the twitter fix in the meantime but passing eos_token_id=32007 to the generate method still resulted in endless generation on my side at eval time. This also happens after overriding eos_token_id's value to 32007 at model and tokenizer initialization! I'm also using unk_token for padding, as advised in the repo fine tuning's example
Could this be related to the chat template's behaviour?
Really curious to get more feedback on this from other users.

gugarosa

Microsoft org Apr 26

Please ensure that you are using the configuration defined in generation_config.json.

gugarosa changed pull request status to closed Apr 26