Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
FremyCompany 
posted an update Apr 26
Post
2098
Today, April 26, is the Day of the Tatar Language! 🌟
To celebrate, we release our new language model, Tweety Tatar 🐣

https://huggingface.co/Tweeties/tweety-tatar-base-7b-2024-v1

The model was converted from Mistral Instruct v0.2 using a novel technique called trans-tokenization. As a result, the model uses a brand-new tokenizer, fully tailored for the Tatar language.

We also release a model which can be finetuned for translation of English or Russian into Tatar, and achieves a performance similar to commercial offerings:

https://huggingface.co/Tweeties/tweety-tatar-hydra-base-7b-2024-v1

More details in our upcoming paper 👀
François REMY, Pieter Delobelle, Alfiya Khabibullina

Татар теле көне белән!

cc @IPSAN

Since you guys used a new novel technique called trans-tokenization did you have to pre-trained it again from scratch

·

Not from scratch, as our technique perserves most of the model weights. But you have to continue pre-training to get most of the benefits, yes. You can read more about it in our preview paper.

We are in the process of releasing a library for replicating this easily, but are not ready to share this yet.