Kaz-RoBERTa (base-sized model)
Model description
Kaz-RoBERTa is a transformers model pretrained on a large corpus of Kazakh data in a self-supervised fashion. More precisely, it was pretrained with the Masked language modeling (MLM) objective.
Usage
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
>>> pipe("Мәтел тура, ауыспалы, астарлы <mask> қолданылады")
#Out:
# {'score': 0.8131822347640991,
# 'token': 18749,
# 'token_str': ' мағынада',
# 'sequence': 'Мәтел тура, ауыспалы, астарлы мағынада қолданылады'},
# ...
# ...]
Training data
The Kaz-RoBERTa model was pretrained on the reunion of 2 datasets:
- MDBKD Multi-Domain Bilingual Kazakh Dataset is a Kazakh-language dataset containing just over 24 883 808 unique texts from multiple domains.
- Conversational data Preprocessed dialogs between Customer Support Team and clients of Beeline KZ (Veon Group)
Together these datasets weigh 25GB of text.
Training procedure
Preprocessing
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 52,000. The inputs of
the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked
with <s>
and the end of one by </s>
Pretraining
The model was trained on 2 V100 GPUs for 500K steps with a batch size of 128 and a sequence length of 512. MLM probability - 15%, num_attention_heads=12, num_hidden_layers=6.
Contributions
Thanks to @BeksultanSagyndyk, @SanzharMrz for adding this model. Point of Contact: Sanzhar Murzakhmetov, Besultan Sagyndyk
- Downloads last month
- 189