Align tokenizer with mistral-common

#225
by Rocketknight1 HF staff - opened
No description provided.

This PR should align the Hugging Face tokenizer with the tokenization in mistral-common. You can test it with the following script:

from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer

chat = [
    {"role": "system", "content": "You are a helpful bot"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"},
    {"role": "assistant", "content": "Fine and you?"},
    {"role": "user", "content": "Fine thank you."},
]

mistral_tok = MistralTokenizer.v1()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", revision="pr/120")

hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)

mistral_encode = mistral_tok.encode_chat_completion(
  ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens

print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))

Hey @Rocketknight1
Thank you for the quick fix!
It passes our assertion tests, excluding the system prompt, which was a problem before the issue anyway.
The mistral common library adds the system prompt with /n/n at the start of the first user message if it is provided.
Is there a reason why you are not adding it to the chat template, in the same manner?

Hi @ChristianPalaArtificialy , no reason! We didn't know that was the preferred behaviour, because mistral-common didn't exist when those templates were written. If you'd like, I can amend the templates to do that.

To be clear, this means that you want something like this, correct? (obviously without destructively editing the input chat)

if messages[0]['role'] == 'system':
    messages[1]['content'] = messages[0]['content'] + '\n\n' + messages[1]['content']
    messages = messages[1:]
for message in messages:
    # Render the messages as normal here

Also, is this the behaviour you want for models using both the v3 and v1 tokenizers?

Hi @Rocketknight1 ,
True, we also speculated about how the system message was added during fine-tuning before the mistral-common library was published.
There is no change in that part of the code between v3 and v1, so yes that would be our preferred behavior!

@ChristianPalaArtificialy updated the template for the v1 models! I'll work on the v3 models in a sec. Please let me know if it's passing all your tests!

v3 models also updated

Hey @Rocketknight1
/n/n is encoded as a hexadecimal <0x0A><0x0A> by the mistral-common library, so I'm getting failures on the text comparisons between the two methods now (but that's just an artifact of how the mistral-common library returns the text).

The assertions on the token comparisons are all good on our end.

All PRs should be open now - I think everything's ready to merge, but let me know if there's anything else you need to do first.

Mistral AI_ org
edited Jul 3

Ok!

patrickvonplaten changed pull request status to merged

Sign up or log in to comment