Fill-Mask
Transformers
PyTorch
Safetensors
xmod
File size: 5,459 Bytes
f983bf7
 
5b093b2
 
 
 
 
cc5c549
5b093b2
 
f983bf7
5b093b2
 
 
20f7ce5
46538fe
5b093b2
 
 
 
 
 
cc5c549
 
 
5b093b2
 
 
 
 
 
 
 
 
 
cc5c549
5b093b2
 
 
 
216aa65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
00874a8
216aa65
 
 
 
 
 
 
 
 
 
 
 
 
00874a8
216aa65
 
5b093b2
 
 
 
 
 
 
 
 
 
cc5c549
 
 
 
5b093b2
 
 
 
 
 
c0a8f01
 
cc5c549
c0a8f01
465cf2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0a8f01
cc5c549
 
 
 
48d4bbc
cc5c549
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: cc-by-nc-4.0
language:
  - de
  - fr
  - it
  - rm
  - gsw
  - multilingual
inference: false
---

SwissBERT is a masked language model for processing Switzerland-related text. It has been trained on more than 21 million Swiss news articles retrieved from [Swissdox@LiRI](https://t.uzh.ch/1hI).

<img src="https://vamvas.ch/assets/swissbert/swissbert-diagram.png" alt="SwissBERT is a transformer encoder with language adapters in each layer. There is an adapter for each national language of Switzerland. The other parameters in the model are shared among the four languages." width="450" style="max-width: 100%;">

SwissBERT is based on [X-MOD](https://huggingface.co/facebook/xmod-base), which has been pre-trained with language adapters in 81 languages.
For SwissBERT we trained adapters for the national languages of Switzerland – German, French, Italian, and Romansh Grischun.
In addition, we used a Switzerland-specific subword vocabulary.

The pre-training code and usage examples are available [here](https://github.com/ZurichNLP/swissbert). We also release a version that was fine-tuned on named entity recognition (NER): https://huggingface.co/ZurichNLP/swissbert-ner

## Update 2024-01: Support for Swiss German
We added a Swiss German adapter to the model.

## Languages

SwissBERT contains the following language adapters:

| lang_id (Adapter index) | Language code | Language              |
|-------------------------|---------------|-----------------------|
| 0                       | `de_CH`       | Swiss Standard German |
| 1                       | `fr_CH`       | French                |
| 2                       | `it_CH`       | Italian               |
| 3                       | `rm_CH`       | Romansh Grischun      |
| 4                       | `gsw`         | Swiss German          |

## License
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

## Usage (masked language modeling)

```python
from transformers import pipeline

fill_mask = pipeline(model="ZurichNLP/swissbert")
```

### German example
```python
fill_mask.model.set_default_language("de_CH")
fill_mask("Der schönste Kanton der Schweiz ist <mask>.")
```
Output:
```
[{'score': 0.1373230218887329,
  'token': 331,
  'token_str': 'Zürich',
  'sequence': 'Der schönste Kanton der Schweiz ist Zürich.'},
 {'score': 0.08464793860912323,
  'token': 5903,
  'token_str': 'Appenzell',
  'sequence': 'Der schönste Kanton der Schweiz ist Appenzell.'},
 {'score': 0.08250337839126587,
  'token': 10800,
  'token_str': 'Graubünden',
  'sequence': 'Der schönste Kanton der Schweiz ist Graubünden.'},
 ...]
```

### French example
```python
fill_mask.model.set_default_language("fr_CH")
fill_mask("Je m'appelle <mask> Federer.")
```
Output:
```
[{'score': 0.9943694472312927,
  'token': 1371,
  'token_str': 'Roger',
  'sequence': "Je m'appelle Roger Federer."},
 ...]
```

## Bias, Risks, and Limitations
- SwissBERT is mainly intended for tagging tokens in written text (e.g., named entity recognition, part-of-speech tagging), text classification, and the encoding of words, sentences or documents into fixed-size embeddings.
SwissBERT is not designed for generating text.
- The model was adapted on written news articles and might perform worse on other domains or language varieties.
- While we have removed many author bylines, we did not anonymize the pre-training corpus. The model might have memorized information that has been described in the news but is no longer in the public interest.

## Training Details
- Training data: German, French, Italian and Romansh documents in the [Swissdox@LiRI](https://t.uzh.ch/1hI) database, until 2022.
- Training procedure: Masked language modeling

The Swiss German adapter was trained on the following two datasets of written Swiss German:
1. [SwissCrawl](https://icosys.ch/swisscrawl)&nbsp;([Linder et al., LREC 2020](https://aclanthology.org/2020.lrec-1.329)), a collection of Swiss German web text (forum discussions, social media).
2. A custom dataset of Swiss German tweets

## Environmental Impact
- Hardware type: RTX 2080 Ti.
- Hours used: 10 epochs × 18 hours × 8 devices = 1440 hours
- Site: Zurich, Switzerland.
- Energy source: 100% hydropower ([source](https://t.uzh.ch/1rU))
- Carbon efficiency: 0.0016 kg CO2e/kWh ([source](https://t.uzh.ch/1rU))
- Carbon emitted: 0.6 kg CO2e ([source](https://mlco2.github.io/impact#compute))

## Citations
```bibtex
@inproceedings{vamvas-etal-2023-swissbert,
    title = "{S}wiss{BERT}: The Multilingual Language Model for {S}witzerland",
    author = {Vamvas, Jannis  and
      Gra{\"e}n, Johannes  and
      Sennrich, Rico},
    editor = {Ghorbel, Hatem  and
      Sokhn, Maria  and
      Cieliebak, Mark  and
      H{\"u}rlimann, Manuela  and
      de Salis, Emmanuel  and
      Guerne, Jonathan},
    booktitle = "Proceedings of the 8th edition of the Swiss Text Analytics Conference",
    month = jun,
    year = "2023",
    address = "Neuchatel, Switzerland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.swisstext-1.6",
    pages = "54--69",
}
```

Swiss German adapter:
```bibtex
@inproceedings{vamvas-etal-2024-modular,
      title={Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect},
      author={Jannis Vamvas and No{\"e}mi Aepli and Rico Sennrich},
      booktitle={First Workshop on Modular and Open Multilingual NLP},
      year={2024},
}
```