xmod-base / README.md

lysandre HF staff

Add XLM-R tokenizer files (#2)

1ff2383 12 months ago

preview code

raw

history blame contribute delete

No virus

9.02 kB

	---
	language:
	- multilingual
	- af
	- am
	- ar
	- az
	- be
	- bg
	- bn
	- ca
	- cs
	- cy
	- da
	- de
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- fi
	- fr
	- ga
	- gl
	- gu
	- ha
	- he
	- hi
	- hr
	- hu
	- hy
	- id
	- is
	- it
	- ja
	- ka
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lo
	- lt
	- lv
	- mk
	- ml
	- mn
	- mr
	- ms
	- my
	- ne
	- nl
	- no
	- or
	- pa
	- pl
	- ps
	- pt
	- ro
	- ru
	- sa
	- si
	- sk
	- sl
	- so
	- sq
	- sr
	- sv
	- sw
	- ta
	- te
	- th
	- tl
	- tr
	- uk
	- ur
	- uz
	- vi
	- zh
	license: mit
	---

	# xmod-base

	X-MOD is a multilingual masked language model trained on filtered CommonCrawl data containing 81 languages. It was introduced in the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) (Pfeiffer et al., NAACL 2022) and first released in [this repository](https://github.com/facebookresearch/fairseq/tree/main/examples/xmod).

	Because it has been pre-trained with language-specific modular components (_language adapters_), X-MOD differs from previous multilingual models like [XLM-R](https://huggingface.co/xlm-roberta-base). For fine-tuning, the language adapters in each transformer layer are frozen.

	# Usage

	## Tokenizer
	This model reuses the tokenizer of [XLM-R](https://huggingface.co/xlm-roberta-base).

	## Input Language
	Because this model uses language adapters, you need to specify the language of your input so that the correct adapter can be activated:

	```python
	from transformers import XmodModel

	model = XmodModel.from_pretrained("facebook/xmod-base")
	model.set_default_language("en_XX")
	```

	A directory of the language adapters in this model is found at the bottom of this model card.

	## Fine-tuning
	In the experiments in the original paper, the embedding layer and the language adapters are frozen during fine-tuning. A method for doing this is provided in the code:

	```python
	model.freeze_embeddings_and_language_adapters()
	# Fine-tune the model ...
	```

	## Cross-lingual Transfer
	After fine-tuning, zero-shot cross-lingual transfer can be tested by activating the language adapter of the target language:
	```python
	model.set_default_language("de_DE")
	# Evaluate the model on German examples ...
	```

	# Bias, Risks, and Limitations

	Please refer to the model card of [XLM-R](https://huggingface.co/xlm-roberta-base), because X-MOD has a similar architecture and has been trained on similar training data.


	# Citation

	BibTeX:

	```bibtex
	@inproceedings{pfeiffer-etal-2022-lifting,
	title = "Lifting the Curse of Multilinguality by Pre-training Modular Transformers",
	author = "Pfeiffer, Jonas and
	Goyal, Naman and
	Lin, Xi and
	Li, Xian and
	Cross, James and
	Riedel, Sebastian and
	Artetxe, Mikel",
	booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
	month = jul,
	year = "2022",
	address = "Seattle, United States",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2022.naacl-main.255",
	doi = "10.18653/v1/2022.naacl-main.255",
	pages = "3479--3495"
	}
	```

	# Languages

	This model contains the following language adapters:

	\| lang_id (Adapter index) \| Language code \| Language \|
	\|-------------------------\|---------------\|-----------------------\|
	\| 0 \| en_XX \| English \|
	\| 1 \| id_ID \| Indonesian \|
	\| 2 \| vi_VN \| Vietnamese \|
	\| 3 \| ru_RU \| Russian \|
	\| 4 \| fa_IR \| Persian \|
	\| 5 \| sv_SE \| Swedish \|
	\| 6 \| ja_XX \| Japanese \|
	\| 7 \| fr_XX \| French \|
	\| 8 \| de_DE \| German \|
	\| 9 \| ro_RO \| Romanian \|
	\| 10 \| ko_KR \| Korean \|
	\| 11 \| hu_HU \| Hungarian \|
	\| 12 \| es_XX \| Spanish \|
	\| 13 \| fi_FI \| Finnish \|
	\| 14 \| uk_UA \| Ukrainian \|
	\| 15 \| da_DK \| Danish \|
	\| 16 \| pt_XX \| Portuguese \|
	\| 17 \| no_XX \| Norwegian \|
	\| 18 \| th_TH \| Thai \|
	\| 19 \| pl_PL \| Polish \|
	\| 20 \| bg_BG \| Bulgarian \|
	\| 21 \| nl_XX \| Dutch \|
	\| 22 \| zh_CN \| Chinese (simplified) \|
	\| 23 \| he_IL \| Hebrew \|
	\| 24 \| el_GR \| Greek \|
	\| 25 \| it_IT \| Italian \|
	\| 26 \| sk_SK \| Slovak \|
	\| 27 \| hr_HR \| Croatian \|
	\| 28 \| tr_TR \| Turkish \|
	\| 29 \| ar_AR \| Arabic \|
	\| 30 \| cs_CZ \| Czech \|
	\| 31 \| lt_LT \| Lithuanian \|
	\| 32 \| hi_IN \| Hindi \|
	\| 33 \| zh_TW \| Chinese (traditional) \|
	\| 34 \| ca_ES \| Catalan \|
	\| 35 \| ms_MY \| Malay \|
	\| 36 \| sl_SI \| Slovenian \|
	\| 37 \| lv_LV \| Latvian \|
	\| 38 \| ta_IN \| Tamil \|
	\| 39 \| bn_IN \| Bengali \|
	\| 40 \| et_EE \| Estonian \|
	\| 41 \| az_AZ \| Azerbaijani \|
	\| 42 \| sq_AL \| Albanian \|
	\| 43 \| sr_RS \| Serbian \|
	\| 44 \| kk_KZ \| Kazakh \|
	\| 45 \| ka_GE \| Georgian \|
	\| 46 \| tl_XX \| Tagalog \|
	\| 47 \| ur_PK \| Urdu \|
	\| 48 \| is_IS \| Icelandic \|
	\| 49 \| hy_AM \| Armenian \|
	\| 50 \| ml_IN \| Malayalam \|
	\| 51 \| mk_MK \| Macedonian \|
	\| 52 \| be_BY \| Belarusian \|
	\| 53 \| la_VA \| Latin \|
	\| 54 \| te_IN \| Telugu \|
	\| 55 \| eu_ES \| Basque \|
	\| 56 \| gl_ES \| Galician \|
	\| 57 \| mn_MN \| Mongolian \|
	\| 58 \| kn_IN \| Kannada \|
	\| 59 \| ne_NP \| Nepali \|
	\| 60 \| sw_KE \| Swahili \|
	\| 61 \| si_LK \| Sinhala \|
	\| 62 \| mr_IN \| Marathi \|
	\| 63 \| af_ZA \| Afrikaans \|
	\| 64 \| gu_IN \| Gujarati \|
	\| 65 \| cy_GB \| Welsh \|
	\| 66 \| eo_EO \| Esperanto \|
	\| 67 \| km_KH \| Central Khmer \|
	\| 68 \| ky_KG \| Kirghiz \|
	\| 69 \| uz_UZ \| Uzbek \|
	\| 70 \| ps_AF \| Pashto \|
	\| 71 \| pa_IN \| Punjabi \|
	\| 72 \| ga_IE \| Irish \|
	\| 73 \| ha_NG \| Hausa \|
	\| 74 \| am_ET \| Amharic \|
	\| 75 \| lo_LA \| Lao \|
	\| 76 \| ku_TR \| Kurdish \|
	\| 77 \| so_SO \| Somali \|
	\| 78 \| my_MM \| Burmese \|
	\| 79 \| or_IN \| Oriya \|
	\| 80 \| sa_IN \| Sanskrit \|