Align tokenizer with mistral-common

#51
by Rocketknight1 HF staff - opened
No description provided.

This PR should align the Hugging Face tokenizer with the tokenization in mistral-common. You can test it with the following script:

from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer

chat = [
    {"role": "system", "content": "You are a helpful bot"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"},
    {"role": "assistant", "content": "Fine and you?"},
    {"role": "user", "content": "Fine thank you."},
]

mistral_tok = MistralTokenizer.v3()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3", revision="pr/51")

hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)

mistral_encode = mistral_tok.encode_chat_completion(
  ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens

print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))
Mistral AI_ org

What about function calling here? Is function calling supported by the HF tokenizer?

Hi @patrickvonplaten , we're actually working on separate PRs for that! You can see an example here.

The plan is to first merge the PRs to align the tokenizers with mistral-common, as they're the most critical. After that, I'll rebase that PR (which is currently only a draft), and open PRs to the other Mistral models that support tool use as well.

To see an example of tool use via HF chat templates in action, check this guide. We're making PRs with updated templates for major classes of tool use models (e.g. NousHermes, Command-R, Mistral/Mixtral), so hopefully users will be able to use the same tool-calling code for any of those models.

Trying to test this PR but getting OS error:pr/51 not a valid branch name,tag name or commit id.

Hi @dolphin12 , I'm not sure why - the code works here! Is it possible that you're accidentally loading from a local directory instead of the Hub?

Mistral folks, can't you make sure HF model tokenizer is updated since only you have control over it? Please do not make it a pain for the users having to integrate your lib (mistral-commons)

patrickvonplaten changed pull request status to merged

Sign up or log in to comment