XLM-Roberta-base NER model for slavic languages
The train / eval / test splits were concatenated from all languages in order as specified in command line:sl, hr, sr, bs, mk, sq, cs, bg, pl, ru, sk, uk
We used the following hyper-parameters:
- 256 max-length for tokenizer
- PyTorch's AdamW algorithm with 2e-5 learning rate
- batch size of 20
- 40 epochs (preliminary runs showed best F1-scores between epochs 15 and 35)
- F1-score for best model selection and training progression.
Based on Analysis of Transfer Learning for Named Entity Recognition in South-Slavic Languages (Ivačič et al., BSNLP 2023)
Used NER Corpora
We used the following NER corpora
@misc{11356/1747,
title = {Training corpus {SUK} 1.0},
author = {Arhar Holdt, {\v S}pela and Krek, Simon and Dobrovoljc, Kaja and Erjavec, Toma{\v z} and Gantar, Polona and {\v C}ibej, Jaka and Pori, Eva and Ter{\v c}on, Luka and Munda, Tina and {\v Z}itnik, Slavko and Robida, Nejc and Blagus, Neli and Mo{\v z}e, Sara and Ledinek, Nina and Holz, Nanika and Zupan, Katja and Kuzman, Taja and Kav{\v c}i{\v c}, Teja and {\v S}krjanec, Iza and Marko, Dafne and Jezer{\v s}ek, Lucija and Zajc, Anja},
url = {http://hdl.handle.net/11356/1747},
note = {Slovenian language resource repository {CLARIN}.{SI}},
copyright = {Creative Commons - Attribution-{NonCommercial}-{ShareAlike} 4.0 International ({CC} {BY}-{NC}-{SA} 4.0)},
issn = {2820-4042},
year = {2022}
}
BSNLP: 3rd Shared Task on SlavNER
We merged 2017+2021 train data with 2021 test data and made custom train / dev / test splits.
We also mapped EVT (event) and PRO (product) tags to MISC to align the corpus with others.
You can change mappings running a custom prepare corpus step (see above).
@misc{11356/1183,
title = {Training corpus hr500k 1.0},
author = {Ljube{\v s}i{\'c}, Nikola and Agi{\'c}, {\v Z}eljko and Klubi{\v c}ka, Filip and Batanovi{\'c}, Vuk and Erjavec, Toma{\v z}},
url = {http://hdl.handle.net/11356/1183},
note = {Slovenian language resource repository {CLARIN}.{SI}},
copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)},
issn = {2820-4042},
year = {2018}
}
@misc{11356/1200,
title = {Training corpus {SETimes}.{SR} 1.0},
author = {Batanovi{\'c}, Vuk and Ljube{\v s}i{\'c}, Nikola and Samard{\v z}i{\'c}, Tanja and Erjavec, Toma{\v z}},
url = {http://hdl.handle.net/11356/1200},
note = {Slovenian language resource repository {CLARIN}.{SI}},
copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)},
issn = {2820-4042},
year = {2018}
}
- Massively Multilingual Transfer for NER. nick-named WikiAnn
@inproceedings{rahimi-etal-2019-massively,
title = "Massively Multilingual Transfer for {NER}",
author = "Rahimi, Afshin and
Li, Yuan and
Cohn, Trevor",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/P19-1015",
pages = "151--164",
}
@Inbook{Strakova2016,
author="Strakov{\'a}, Jana and Straka, Milan and Haji{\v{c}}, Jan",
editor="Sojka, Petr and Hor{\'a}k, Ale{\v{s}} and Kope{\v{c}}ek, Ivan and Pala, Karel",
title="Neural Networks for Featureless Named Entity Recognition in Czech",
bookTitle="Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno , Czech Republic, September 12-16, 2016, Proceedings",
year="2016",
publisher="Springer International Publishing",
address="Cham",
pages="173--181",
isbn="978-3-319-45510-5",
doi="10.1007/978-3-319-45510-5_20",
url="http://dx.doi.org/10.1007/978-3-319-45510-5_20"
}
NER Evaluation
For evaluation, we use seqeval
@misc{seqeval,
title={{seqeval}: A Python framework for sequence labeling evaluation},
url={https://github.com/chakki-works/seqeval},
note={Software available from https://github.com/chakki-works/seqeval},
author={Hiroki Nakayama},
year={2018},
}
Which is based on
@inproceedings{ramshaw-marcus-1995-text,
title = "Text Chunking using Transformation-Based Learning",
author = "Ramshaw, Lance and
Marcus, Mitch",
booktitle = "Third Workshop on Very Large Corpora",
year = "1995",
url = "https://www.aclweb.org/anthology/W95-0107",
}
- Downloads last month
- 20
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Evaluation results
- Accuracyself-reported98.346
- F1-scoreself-reported93.158
- Precisionself-reported92.700
- Recallself-reported93.622
- LOC Precisionself-reported94.105
- LOC Recallself-reported95.513
- LOC F1-scoreself-reported94.804
- MISC Precisionself-reported85.196
- MISC Recallself-reported85.545
- MISC F1-scoreself-reported85.370