Update README.md
Browse files
README.md
CHANGED
@@ -6,3 +6,28 @@ language:
|
|
6 |
widget:
|
7 |
- text: "Hinder s'Hans-Heiris Huus hani hundert Hase ghöre hueschte."
|
8 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
widget:
|
7 |
- text: "Hinder s'Hans-Heiris Huus hani hundert Hase ghöre hueschte."
|
8 |
---
|
9 |
+
|
10 |
+
The [**google/canine-s**](https://huggingface.co/google/canine-s) model ([Clark et al., TACL 2022](https://aclanthology.org/2022.tacl-1.5/)) trained on Swiss German text data via continued pre-training.
|
11 |
+
|
12 |
+
## Training Objective
|
13 |
+
We used the CANINE-S objective combined with the subword vocabulary of [SwissBERT](https://huggingface.co/ZurichNLP/swissbert).
|
14 |
+
|
15 |
+
## Training Data
|
16 |
+
For continued pre-training, we used the following two datasets of written Swiss German:
|
17 |
+
1. [SwissCrawl](https://icosys.ch/swisscrawl) ([Linder et al., LREC 2020](https://aclanthology.org/2020.lrec-1.329)), a collection of Swiss German web text (forum discussions, social media).
|
18 |
+
2. A custom dataset of Swiss German tweets
|
19 |
+
|
20 |
+
In addition, we trained the model on an equal amount of Standard German data. We used news articles retrieved from [Swissdox@LiRI](https://t.uzh.ch/1hI).
|
21 |
+
|
22 |
+
## License
|
23 |
+
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
|
24 |
+
|
25 |
+
## Citation
|
26 |
+
```bibtex
|
27 |
+
@inproceedings{vamvas-etal-2024-modular,
|
28 |
+
title={Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect},
|
29 |
+
author={Jannis Vamvas and No{\"e}mi Aepli and Rico Sennrich},
|
30 |
+
booktitle={First Workshop on Modular and Open Multilingual NLP},
|
31 |
+
year={2024},
|
32 |
+
}
|
33 |
+
```
|