tokenize ChatML format

by Bakanayatsu - opened Feb 18

Feb 18

•

Hi, I'm encountering a common issue with the model ending with wrong variations of <|im_end|> since it's technically multiple tokens instead of one, so it has a higher chance of outputting wrong end of turn token(s).

<|im_end|>
Tokens: [523, 28766, 321, 28730, 416, 28766, 28767]

7 tokens instead of 1 which is honestly terrible mistake since it sometimes outputs wrong tokens like: <|user|>, <||, <|\n<user>

francislabounty

SERP AI org Feb 18

•

edited Feb 18

Thanks for raising the issue! Next model we train will have that fixed (as in sparsetral-v3 or another architecture), have a spin iteration 1 training and will be uploaded soon but it has the same tokenization issue. (Model should be trained to output end token (one model was normally trained on) for it's responses though. Do you have any examples of what causes the model to output broken formatting like that?)

Bakanayatsu

Feb 19

•

edited Feb 19

Do you have any examples of what causes the model to output broken formatting like that?)

Simply chatting turn by turn normally produces it. 0.7 temperature. Other than that, the model is pretty amazing. It comprehends much more than a neuralhermes I've tested. Even 2 bit Smaug 34b.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment