File size: 1,392 Bytes
b81a4df 830102c b81a4df cd7af0e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
---
license: cc-by-nc-4.0
language:
- gsw
- multilingual
widget:
- text: "Hinder s'Hans-Heiris Huus hani hundert Hase ghöre hueschte."
---
The [**google/canine-s**](https://huggingface.co/google/canine-s) model ([Clark et al., TACL 2022](https://aclanthology.org/2022.tacl-1.5/)) trained on Swiss German text data via continued pre-training.
## Training Objective
We used the CANINE-S objective combined with the subword vocabulary of [SwissBERT](https://huggingface.co/ZurichNLP/swissbert).
## Training Data
For continued pre-training, we used the following two datasets of written Swiss German:
1. [SwissCrawl](https://icosys.ch/swisscrawl) ([Linder et al., LREC 2020](https://aclanthology.org/2020.lrec-1.329)), a collection of Swiss German web text (forum discussions, social media).
2. A custom dataset of Swiss German tweets
In addition, we trained the model on an equal amount of Standard German data. We used news articles retrieved from [Swissdox@LiRI](https://t.uzh.ch/1hI).
## License
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
## Citation
```bibtex
@inproceedings{vamvas-etal-2024-modular,
title={Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect},
author={Jannis Vamvas and No{\"e}mi Aepli and Rico Sennrich},
booktitle={First Workshop on Modular and Open Multilingual NLP},
year={2024},
}
``` |