jvamvas commited on
Commit
54ab71f
1 Parent(s): 8358c8d

Add model card

Browse files
Files changed (1) hide show
  1. README.md +162 -0
README.md CHANGED
@@ -83,3 +83,165 @@ language:
83
  - zh
84
  license: mit
85
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  - zh
84
  license: mit
85
  ---
86
+
87
+ # xmod-base
88
+
89
+ X-MOD is a multilingual masked language model trained on filtered CommonCrawl data containing 81 languages. It was introduced in the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) (Pfeiffer et al., NAACL 2022) and first released in [this repository](https://github.com/facebookresearch/fairseq/tree/main/examples/xmod).
90
+
91
+ Because it has been pre-trained with language-specific modular components (_language adapters_), X-MOD differs from previous multilingual models like [XLM-R](https://huggingface.co/xlm-roberta-base). For fine-tuning, the language adapters in each transformer layer are frozen.
92
+
93
+ # Usage
94
+
95
+ ## Tokenizer
96
+ This model reuses the tokenizer of [XLM-R](https://huggingface.co/xlm-roberta-base), so you can load the tokenizer as follows:
97
+
98
+ ```python
99
+ from transformers import AutoTokenizer
100
+
101
+ tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
102
+ ```
103
+
104
+ ## Input Language
105
+ Because this model uses language adapters, you need to specify the language of your input so that the correct adapter can be activated:
106
+
107
+ ```python
108
+ from transformers import XMODModel
109
+
110
+ model = XMODModel.from_pretrained("jvamvas/xmod-base")
111
+ model.set_default_language("en_XX")
112
+ ```
113
+
114
+ A directory of the language adapters in this model is found at the bottom of this model card.
115
+
116
+ ## Fine-tuning
117
+ The paper recommends that the embedding layer and the language adapters are frozen during fine-tuning. A method for doing this is provided in the code:
118
+
119
+ ```python
120
+ model.freeze_embeddings_and_language_adapters()
121
+ # Fine-tune the model ...
122
+ ```
123
+
124
+ ## Cross-lingual Transfer
125
+ After fine-tuning, zero-shot cross-lingual transfer can be tested by activating the language adapter of the target language:
126
+ ```python
127
+ model.set_default_language("de_DE")
128
+ # Evaluate the model on German examples ...
129
+ ```
130
+
131
+ # Bias, Risks, and Limitations
132
+
133
+ Please refer to the model card of [XLM-R](https://huggingface.co/xlm-roberta-base), because X-MOD has a similar architecture and has been trained on similar training data.
134
+
135
+
136
+ # Citation
137
+
138
+ **BibTeX:**
139
+
140
+ ```bibtex
141
+ @inproceedings{pfeiffer-etal-2022-lifting,
142
+ title = "Lifting the Curse of Multilinguality by Pre-training Modular Transformers",
143
+ author = "Pfeiffer, Jonas and
144
+ Goyal, Naman and
145
+ Lin, Xi and
146
+ Li, Xian and
147
+ Cross, James and
148
+ Riedel, Sebastian and
149
+ Artetxe, Mikel",
150
+ booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
151
+ month = jul,
152
+ year = "2022",
153
+ address = "Seattle, United States",
154
+ publisher = "Association for Computational Linguistics",
155
+ url = "https://aclanthology.org/2022.naacl-main.255",
156
+ doi = "10.18653/v1/2022.naacl-main.255",
157
+ pages = "3479--3495"
158
+ }
159
+ ```
160
+
161
+ # Languages
162
+
163
+ This model contains the following language adapters:
164
+
165
+ | Language code | Language |
166
+ |---------------|-----------------------|
167
+ | af_ZA | Afrikaans |
168
+ | am_ET | Amharic |
169
+ | ar_AR | Arabic |
170
+ | az_AZ | Azerbaijani |
171
+ | be_BY | Belarusian |
172
+ | bg_BG | Bulgarian |
173
+ | bn_IN | Bengali |
174
+ | ca_ES | Catalan |
175
+ | cs_CZ | Czech |
176
+ | cy_GB | Welsh |
177
+ | da_DK | Danish |
178
+ | de_DE | German |
179
+ | el_GR | Greek |
180
+ | en_XX | English |
181
+ | eo_EO | Esperanto |
182
+ | es_XX | Spanish |
183
+ | et_EE | Estonian |
184
+ | eu_ES | Basque |
185
+ | fa_IR | Persian |
186
+ | fi_FI | Finnish |
187
+ | fr_XX | French |
188
+ | ga_IE | Irish |
189
+ | gl_ES | Galician |
190
+ | gu_IN | Gujarati |
191
+ | ha_NG | Hausa |
192
+ | he_IL | Hebrew |
193
+ | hi_IN | Hindi |
194
+ | hr_HR | Croatian |
195
+ | hu_HU | Hungarian |
196
+ | hy_AM | Armenian |
197
+ | id_ID | Indonesian |
198
+ | is_IS | Icelandic |
199
+ | it_IT | Italian |
200
+ | ja_XX | Japanese |
201
+ | ka_GE | Georgian |
202
+ | kk_KZ | Kazakh |
203
+ | km_KH | Central Khmer |
204
+ | kn_IN | Kannada |
205
+ | ko_KR | Korean |
206
+ | ku_TR | Kurdish |
207
+ | ky_KG | Kirghiz |
208
+ | la_VA | Latin |
209
+ | lo_LA | Lao |
210
+ | lt_LT | Lithuanian |
211
+ | lv_LV | Latvian |
212
+ | mk_MK | Macedonian |
213
+ | ml_IN | Malayalam |
214
+ | mn_MN | Mongolian |
215
+ | mr_IN | Marathi |
216
+ | ms_MY | Malay |
217
+ | my_MM | Burmese |
218
+ | ne_NP | Nepali |
219
+ | nl_XX | Dutch |
220
+ | no_XX | Norwegian |
221
+ | or_IN | Oriya |
222
+ | pa_IN | Punjabi |
223
+ | pl_PL | Polish |
224
+ | ps_AF | Pashto |
225
+ | pt_XX | Portuguese |
226
+ | ro_RO | Romanian |
227
+ | ru_RU | Russian |
228
+ | sa_IN | Sanskrit |
229
+ | si_LK | Sinhala |
230
+ | sk_SK | Slovak |
231
+ | sl_SI | Slovenian |
232
+ | so_SO | Somali |
233
+ | sq_AL | Albanian |
234
+ | sr_RS | Serbian |
235
+ | sv_SE | Swedish |
236
+ | sw_KE | Swahili |
237
+ | ta_IN | Tamil |
238
+ | te_IN | Telugu |
239
+ | th_TH | Thai |
240
+ | tl_XX | Tagalog |
241
+ | tr_TR | Turkish |
242
+ | uk_UA | Ukrainian |
243
+ | ur_PK | Urdu |
244
+ | uz_UZ | Uzbek |
245
+ | vi_VN | Vietnamese |
246
+ | zh_CN | Chinese (simplified) |
247
+ | zh_TW | Chinese (traditional) |