Tokenizer is incorrectly tokenizing '<|im_start|>' and '<|im_end|>' as strings
#5
by
Light4Bear
- opened
'<|im_start|>' and '<|im_end|>' are not marked as special tokens, so they are tokenized as strings instead of a single special token.
>>> import transformers
>>> tokenizer = transformers.AutoTokenizer.from_pretrained("models/jondurbin_bagel-34b-v0.4")
>>> tokenizer.decode([6])
'<|im_start|>'
>>> tokenizer.encode(tokenizer.decode([6]))
[59666, 59705, 622, 59593, 5858, 46826]
So is 1/4 of the training done on the wrong tokenization?
I do find the response when using ChatML is wrose than Alpaca.