HF-compatible tokenizer
Hey there! You might be interested in my transformers-compatible version of the grok-1 tokenizer, https://huggingface.co/Xenova/grok-1-tokenizer, which can be used as follows:
from transformers import LlamaTokenizerFast
tokenizer = LlamaTokenizerFast.from_pretrained('Xenova/grok-1-tokenizer')
inputs = tokenizer('hello world')
This should be able to simplify the example code quite a bit.
Also, feel free to copy the tokenizer.json and tokenizer_config.json to your repo, and then you can use your model ID.
Hey @Xenova ! Thank you for your work on the transformers-compatible version of the tokenizer! We now have a link to your HuggingFace Hub and use your tokenizer in our example: https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/grok-1/inference_tp.py
This is lovely!
Hello can this be set in tokenizer_config.json?
@philschmid maybe there could be a ChatML-compatible tokenizer for Grok? π π
Hi @Jonathan1909 ! Yes, they are identical tokenizers, and have been tested on the entire xnli dataset (all languages). The HF version matches the original sentencepiece version exactly.
Hi @Jonathan1909 ! Yes, they are identical tokenizers, and have been tested on the entire xnli dataset (all languages). The HF version matches the original sentencepiece version exactly.
Thank you @Xenova ! I've merged the PR and tested on it. That works well!
Please close this