BERTislav
Baseline fill-mask model based on ruBERT and fine-tuned on a 10M-word corpus of mixed Old Church Slavonic, (Later) Church Slavonic, Old East Slavic, Middle Russian, and Medieval Serbian texts.
Overview
- Model Name: BERTislav
- Task: Fill-mask
- Base Model: ai-forever/ruBert-base
- Languages: orv (Old East Slavic, Middle Russian), cu (Old Church Slavonic, Church Slavonic)
- Developed by: Nilo Pedrazzini
Input Format
A str
-type input with [MASK]ed tokens.
Output Format
The predicted token, with the confidence score for each labels.
Examples
Example 1:
COMING SOON
Uses
The model can be used as a baseline model for further finetuning to perform specific downstream tasks (e.g. linguistic annotation).
Bias, Risks, and Limitations
The model should only be considered a baseline, and should not be evaluated on its own. Testing is needed regarding its usefulness to improve the performance of language models finetuned for specific tasks.
Training Details
The texts used as training data are from the following sources:
- Fundamental Digital Library Russian Literature & Folklore (FEB-web)
- Puškinskij Dom's Библиотека литературы Древней Руси
- Cyrillomethodiana
- Parts of the Bdinski Sbornik, as digitized in Obdurodon.
- Tromsø Old Russian and Old Church Slavonic Treebank (TOROT).
NB: Texts were heavily normalized and anyone planning to use the model is advised to do the same for the best outcome. Use the provided normalization script, customizing it as needed.
Model Card Authors
Nilo Pedrazzini
Model Card Contact
How to use the model
COMING SOON
- Downloads last month
- 2