HF-compatible tokenizer

by Xenova HF staff - opened Mar 24

Mar 24

Hey there! You might be interested in my transformers-compatible version of the grok-1 tokenizer, https://huggingface.co/Xenova/grok-1-tokenizer, which can be used as follows:

from transformers import LlamaTokenizerFast

tokenizer = LlamaTokenizerFast.from_pretrained('Xenova/grok-1-tokenizer')
inputs = tokenizer('hello world')

This should be able to simplify the example code quite a bit.

Also, feel free to copy the tokenizer.json and tokenizer_config.json to your repo, and then you can use your model ID.

Jonathan1909

HPC-AI Technology org Mar 25

Hey @Xenova ! Thank you for your work on the transformers-compatible version of the tokenizer! We now have a link to your HuggingFace Hub and use your tokenizer in our example: https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/grok-1/inference_tp.py

ehartford

Mar 26

This is lovely!

ehartford

Mar 26

Hello can this be set in tokenizer_config.json?

ehartford

Mar 27

@philschmid maybe there could be a ChatML-compatible tokenizer for Grok? 😁 🙏

Jonathan1909

HPC-AI Technology org Mar 27

•

edited Mar 27

Hey @Xenova , would you like to take a look at this PR, which includes tokenizer-related file changes, opened by @ArthurZ ? Not sure if it's the same as your grok-1 tokenizer or you guys together worked on it, so just wanna check with you.

Xenova

Mar 27

Hi @Jonathan1909 ! Yes, they are identical tokenizers, and have been tested on the entire xnli dataset (all languages). The HF version matches the original sentencepiece version exactly.

Jonathan1909

HPC-AI Technology org Mar 28

Hi @Jonathan1909 ! Yes, they are identical tokenizers, and have been tested on the entire xnli dataset (all languages). The HF version matches the original sentencepiece version exactly.

Thank you @Xenova ! I've merged the PR and tested on it. That works well!

ehartford

Apr 1

Please close this

Xenova changed discussion status to closed Apr 1

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment