Model Card for Model ID
This is a tokenizer for translated version of Tiny stories dataset to Azerbaijani.
Model Details
Model Description
This is a tokenizer trained on translated version of Tiny Stories dataset (to Azerbaijani) I trained Byte-fallback BPE tokenizer on this dataset and I used the similar parameters that used in tokenizer of Mistral. Like sentencepiece "_" used for the beginning of the pieces in the sub-words.
- Developed by: Javidan Aslanli
- Language(s) (NLP): Azerbaijani
- License: Apache license 2.0
Training Details
Training Data
Translated Tiny stories
Training:
This is a Byte-fallback BPE tokenizer. What I used in tokenizer is:
- Normalizers are same with the tokenizer of Mistral's normalizers
- I used Meta-Space pre-tokenizer before training BPE.
- For training I used Byte-fallback trick and other parameters are same with Mistral's.