Commit f319d91 breaks eos_token_id

#9
by tomer - opened

Hi,
After the last commit, eos_token_id == vocab_size, which results in an overflow in the embedding matrix.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-multi", trust_remote_code=True,
                                         revision="f319d912c0c73ea3682094202b209ac8cb5d4cba")
print(tokenizer.eos_token, tokenizer.eos_token_id, tokenizer.vocab_size)

prints:

<|endoftext|> 51200 51200
Salesforce org

Yeah. I have the same issue, quite frustrating.

My workaround, manually change to 50256, which is the same EOS token

Salesforce org

Still have this issue

@allanjie check out Deci/DeciCoder-6B, it patches this bug and uses the same tokenizer as Salesforce/codegen25-7b-multi_P

Sign up or log in to comment