Align tokenizer with mistral-common
#141
by
Rocketknight1
HF staff
- opened
No description provided.
This PR should align the Hugging Face tokenizer with the tokenization in mistral-common. You can test it with the following script:
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer
chat = [
{"role": "system", "content": "You are a helpful bot"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "How are you?"},
{"role": "assistant", "content": "Fine and you?"},
{"role": "user", "content": "Fine thank you."},
]
mistral_tok = MistralTokenizer.v1()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", revision="pr/120")
hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)
mistral_encode = mistral_tok.encode_chat_completion(
ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens
print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))
Thanks for the code snippet. Did you try it on a number of chat prompts to see if the two tokenizers' results are the same?
patrickvonplaten
changed pull request status to
merged
Does this code mean that MistralTokenizer and AutoTokenizer are tokenizing the text just the same as together? I'm asking because the encoded chat texts are different but the tokens are the same for both tokenizers. In this case, it seems unnecessary to use mistral_inference.generate for model inference; because the inference is also based on tokens. So, I guess there is no need to use mistral_common and mistral_inference, right?