Model doesn't seem to tokenize new lines in chat template?
Noticed when using transformers to extract a chat template that it prints out without any newlines, replacing them instead with spaces.. Any idea why?
Only seems to apply when using the built in tags, so it seems to tokenize <|system|>\n as just '<|system|>'
If i typo it and make it <|systemA|>\n, it tokenizes as '<|systemA|>\n' properly..
Actually I just noticed why, it's because <|user|>, <|end|>, <|system|>, and <|assistant|> all have rstrip = true in tokenizer_config.json.. if I take that out, it properly puts the new lines. Which is correct? The chat template seems to imply there should be new lines, as do the examples on your card
I agree that rstrip=true
causes many odd issues with tokenization, and directly conflicts with the chat_template/examples. Would love to see one or the other changed for consistency!
I have similar issue
When I feed a text block that contains new lines into the Phi-3 tokeniser, the new lines are removed after decoding. Here is an example of the text I am working with:
Input Text to the tokenizer:
<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>
after tokenizer.decode I got this:
<|system|>
You are a helpful assistant.<|end|><|user|>
How to explain Internet for a medieval knight?<|end|><|assistant|>
Can you help me with this issue and is it affecting the performance of the model if I proceed with this ?