Align tokenizer with mistral-common
This PR should align the Hugging Face tokenizer with the tokenization in mistral-common. You can test it with the following script:
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer
chat = [
{"role": "system", "content": "You are a helpful bot"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "How are you?"},
{"role": "assistant", "content": "Fine and you?"},
{"role": "user", "content": "Fine thank you."},
]
mistral_tok = MistralTokenizer.v3()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3", revision="pr/51")
hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)
mistral_encode = mistral_tok.encode_chat_completion(
ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens
print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))
What about function calling here? Is function calling supported by the HF tokenizer?
Hi @patrickvonplaten , we're actually working on separate PRs for that! You can see an example here.
The plan is to first merge the PRs to align the tokenizers with mistral-common
, as they're the most critical. After that, I'll rebase that PR (which is currently only a draft), and open PRs to the other Mistral models that support tool use as well.
To see an example of tool use via HF chat templates in action, check this guide. We're making PRs with updated templates for major classes of tool use models (e.g. NousHermes, Command-R, Mistral/Mixtral), so hopefully users will be able to use the same tool-calling code for any of those models.
Trying to test this PR but getting OS error:pr/51 not a valid branch name,tag name or commit id.
Hi @dolphin12 , I'm not sure why - the code works here! Is it possible that you're accidentally loading from a local directory instead of the Hub?
Mistral folks, can't you make sure HF model tokenizer is updated since only you have control over it? Please do not make it a pain for the users having to integrate your lib (mistral-commons)