tokenize ChatML format

#2
by Bakanayatsu - opened

Hi, I'm encountering a common issue with the model ending with wrong variations of <|im_end|> since it's technically multiple tokens instead of one, so it has a higher chance of outputting wrong end of turn token(s).

<|im_end|>
Tokens: [523, 28766, 321, 28730, 416, 28766, 28767]

7 tokens instead of 1 which is honestly terrible mistake since it sometimes outputs wrong tokens like: <|user|>, <||, <|\n<user>

Thanks for raising the issue! Next model we train will have that fixed (as in sparsetral-v3 or another architecture), have a spin iteration 1 training and will be uploaded soon but it has the same tokenization issue. (Model should be trained to output end token (one model was normally trained on) for it's responses though. Do you have any examples of what causes the model to output broken formatting like that?)

Do you have any examples of what causes the model to output broken formatting like that?)

Simply chatting turn by turn normally produces it. 0.7 temperature. Other than that, the model is pretty amazing. It comprehends much more than a neuralhermes I've tested. Even 2 bit Smaug 34b.

Sign up or log in to comment