metadata
language:
- is
- da
- sv
- 'no'
- fo
widget:
- text: Fina lilla<mask>, jag vill inte bliva stur.
- text: Nu ved jeg, at du frygter<mask> og end ikke vil nægte mig din eneste søn..
- text: Það er vorhret á<mask>, napur vindur sem hvín.
- text: Ja, Gud signi<mask>, mítt land.
- text: Alle dyrene i<mask> må være venner.
tags:
- roberta
- icelandic
- norwegian
- faroese
- danish
- swedish
- masked-lm
- pytorch
license: agpl-3.0
ScandiBERT-no-faroese
This is a version of the ScandiBERT model trained without any Faroese data and a different subword tokenizer.
The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks.
Language | Data | Size |
---|---|---|
Icelandic | See IceBERT paper | 16 GB |
Danish | Danish Gigaword Corpus (incl Twitter) | 4,7 GB |
Norwegian | NCC corpus | 42 GB |
Swedish | Swedish Gigaword Corpus | 3,4 GB |
If you find this model useful, please cite
@inproceedings{snaebjarnarson-etal-2023-transfer,
title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese",
author = "Snæbjarnarson, Vésteinn and
Simonsen, Annika and
Glavaš, Goran and
Vulić, Ivan",
booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
month = "may 22--24",
year = "2023",
address = "Tórshavn, Faroe Islands",
publisher = {Link{\"o}ping University Electronic Press, Sweden},
}