Align tokenizer with mistral-common

#51

by Rocketknight1 HF staff - opened Jun 26

base: refs/heads/main

←

from: refs/pr/51

Discussion Files changed

-7

Align tokenizer with mistral-common6a36bb05

Rocketknight1

Jun 26

No description provided.

Rocketknight1

Jun 26

•

edited Jun 26

This PR should align the Hugging Face tokenizer with the tokenization in mistral-common. You can test it with the following script:

from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer

chat = [
    {"role": "system", "content": "You are a helpful bot"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"},
    {"role": "assistant", "content": "Fine and you?"},
    {"role": "user", "content": "Fine thank you."},
]

mistral_tok = MistralTokenizer.v3()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3", revision="pr/51")

hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)

mistral_encode = mistral_tok.encode_chat_completion(
  ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens

print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))

Defend the honour of the Hugging Face tokenizerd9fd2986

Update chat template to handle system message38c100ec

patrickvonplaten

Mistral AI_ org Jul 3

What about function calling here? Is function calling supported by the HF tokenizer?

Rocketknight1

Jul 4

•

edited Jul 4

Hi @patrickvonplaten , we're actually working on separate PRs for that! You can see an example here.

The plan is to first merge the PRs to align the tokenizers with mistral-common, as they're the most critical. After that, I'll rebase that PR (which is currently only a draft), and open PRs to the other Mistral models that support tool use as well.

To see an example of tool use via HF chat templates in action, check this guide. We're making PRs with updated templates for major classes of tool use models (e.g. NousHermes, Command-R, Mistral/Mixtral), so hopefully users will be able to use the same tool-calling code for any of those models.

dolphin12

Jul 5

Trying to test this PR but getting OS error:pr/51 not a valid branch name,tag name or commit id.

Rocketknight1

Jul 5

Hi @dolphin12 , I'm not sure why - the code works here! Is it possible that you're accidentally loading from a local directory instead of the Hub?

yxue-jamandtea

Jul 12

Mistral folks, can't you make sure HF model tokenizer is updated since only you have control over it? Please do not make it a pain for the users having to integrate your lib (mistral-commons)

patrickvonplaten changed pull request status to merged Jul 17

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment