Question about added tokens

#7
by xzuyn - opened

You used ChatML which is <|im_start|> and <|im_end|>, but only added <|im_end|> as a token.

Did you train the <|im_start|> un-tokenized, or did you use <|start_header_id|>/another special token and forget to rename it after training?

I've seen people training Llama-3 with ChatML and don't even add the <im_end>. But I did since I saw improvements. This is the template used to train the model and recommended to follow:

<|im_start|>system
{System}
<|im_end|>
<|im_start|>user
{User}
<|im_end|>
<|im_start|>assistant
{Assistant}

This is also the safest way to add eos_id:

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|im_end|>"),
    tokenizer.convert_tokens_to_ids("<|eot_id|>") # safer to have this too
]

Works great with ChatML

What's your reasoning behind only adding <|im_end|> and not also adding <|im_start|>? Cause if you didn't find some quality benefit to it you just end up wasting a lot of tokens every turn because of it, and you're already using the extra memory needed to train lm_head and embed_tokens (assuming you used LoRA/QLoRA since it probably isn't different for FFT) so you may as well add <|im_start|> at the same time.

https://belladoreai.github.io/llama3-tokenizer-js/example-demo/build/
image.png

What's your reasoning behind only adding <|im_end|> and not also adding <|im_start|>? Cause if you didn't find some quality benefit to it you just end up wasting a lot of tokens every turn because of it, and you're already using the extra memory needed to train lm_head and embed_tokens (assuming you used LoRA/QLoRA since it probably isn't different for FFT) so you may as well add <|im_start|> at the same time.

All of these are very true. I initially added them, the score didn't change much, however, it was a bit hard to have the model stopping where it should. I guess it's one of those things that maybe the Instruct model was just not easy to change the template, or maybe there was/is a bug in the Axolotl regarding Llama-3 tokenization.

After a couple of try, the easiest way to use ChatML was this unusual addition. I will investigate more to see if maybe by now it could be possible to add those extra tokens and have the model generate text without any issue. (the extra tokens are an issue to me specially in a 8K context, but the generation and quality have passed my tests)

Sign up or log in to comment