Model doesn't seem to tokenize new lines in chat template?

#84

by bartowski - opened Jul 4

Jul 4

Noticed when using transformers to extract a chat template that it prints out without any newlines, replacing them instead with spaces.. Any idea why?

Only seems to apply when using the built in tags, so it seems to tokenize <|system|>\n as just '<|system|>'

If i typo it and make it <|systemA|>\n, it tokenizes as '<|systemA|>\n' properly..

bartowski changed discussion title from Model doesn't seem to tokenize new lines in system prompt? to Model doesn't seem to tokenize new lines in chat template? Jul 4

bartowski

Jul 4

Actually I just noticed why, it's because <|user|>, <|end|>, <|system|>, and <|assistant|> all have rstrip = true in tokenizer_config.json.. if I take that out, it properly puts the new lines. Which is correct? The chat template seems to imply there should be new lines, as do the examples on your card

bartowski

Jul 4

@nguyenbh @gargamit

hanori

Microsoft org Jul 8

I agree that rstrip=true causes many odd issues with tokenization, and directly conflicts with the chat_template/examples. Would love to see one or the other changed for consistency!

bartowski

Jul 8

is there any way it can be escalated @hanori ?

UserDAN

Jul 16

I have similar issue

When I feed a text block that contains new lines into the Phi-3 tokeniser, the new lines are removed after decoding. Here is an example of the text I am working with:
Input Text to the tokenizer:

<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>

after tokenizer.decode I got this:

<|system|>
You are a helpful assistant.<|end|><|user|>
How to explain Internet for a medieval knight?<|end|><|assistant|>

Can you help me with this issue and is it affecting the performance of the model if I proceed with this ?

UserDAN

Jul 17

•

edited Jul 17

can you help us with this please @hanori @gugarosa

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment