Fill-Mask
Transformers
PyTorch
Safetensors
xmod
jvamvas commited on
Commit
5b093b2
1 Parent(s): a3f4d86

Add model card

Browse files
Files changed (1) hide show
  1. README.md +47 -0
README.md CHANGED
@@ -1,3 +1,50 @@
1
  ---
2
  license: cc-by-nc-4.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
+ language:
4
+ - de
5
+ - fr
6
+ - it
7
+ - rm
8
+ - multilingual
9
+ inference: false
10
  ---
11
+
12
+ SwissBERT is a masked language model for processing Switzerland-related text. It has been trained on more than 21 million Swiss news articles retrieved from [Swissdox@LiRI](https://t.uzh.ch/1hI).
13
+
14
+ SwissBERT is based on [X-MOD](https://huggingface.co/facebook/xmod-base), which has been pre-trained with language adapters in 81 languages.
15
+ For SwissBERT we trained adapters for the national languages of Switzerland – German, French, Italian, and Romansh Grischun.
16
+ In addition, we used a Switzerland-specific subword vocabulary.
17
+
18
+ The pre-training code and usage examples are available [here](https://github.com/ZurichNLP/swissbert). We also release a version that was fine-tuned on named entity recognition (NER): https://huggingface.co/ZurichNLP/swissbert-ner
19
+
20
+ ## Languages
21
+
22
+ SwissBERT contains the following language adapters:
23
+
24
+ | lang_id (Adapter index) | Language code | Language |
25
+ |-------------------------|---------------|-----------------------|
26
+ | 0 | `de_CH` | Swiss Standard German |
27
+ | 1 | `fr_CH` | French |
28
+ | 2 | `it_CH` | Italian |
29
+ | 3 | `rm_CH` | Romansh Grischun |
30
+
31
+ ## License
32
+ Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
33
+
34
+ ## Bias, Risks, and Limitations
35
+ - SwissBERT is mainly intended for tagging tokens in written text (e.g., named entity recognition, part-of-speech tagging), text classification, and the encoding of words, sentences or documents into fixed-size embeddings.
36
+ SwissBERT is not designed for generating text.
37
+ - The model was adapted on written news articles and might perform worse on other domains or language varieties.
38
+ - While we have removed many author bylines, we did not anonymize the pre-training corpus. The model might have memorized information that has been described in the news but is no longer in the public interest.
39
+
40
+ ## Training Details
41
+ - Training data: German, French, Italian and Romansh documents in the [Swissdox@LiRI](https://t.uzh.ch/1hI) database, until 2022.
42
+ - Training procedure: Masked language modeling
43
+
44
+ ## Environmental Impact
45
+ - Hardware type: RTX 2080 Ti.
46
+ - Hours used: 10 epochs × 18 hours × 8 devices = 1440 hours
47
+ - Site: Zurich, Switzerland.
48
+ - Energy source: 100% hydropower ([source](https://t.uzh.ch/1rU))
49
+ - Carbon efficiency: 0.0016 kg CO2e/kWh ([source](https://t.uzh.ch/1rU))
50
+ - Carbon emitted: 0.6 kg CO2e ([source](https://mlco2.github.io/impact#compute))