muri-101 / README.md

akoksal

Update README.md

26fec21 verified 1 day ago

preview code

raw

history blame contribute delete

No virus

6.44 kB

	---
	license: apache-2.0
	datasets:
	- akoksal/muri-it
	language:
	- afr
	- amh
	- ara
	- aze
	- bel
	- ben
	- bul
	- cat
	- ceb
	- ces
	- cos
	- cym
	- dan
	- deu
	- ell
	- eng
	- epo
	- est
	- eus
	- fas
	- fin
	- fra
	- fry
	- gla
	- gle
	- glg
	- guj
	- hat
	- hau
	- haw
	- hbs
	- heb
	- hin
	- hun
	- hye
	- ibo
	- isl
	- ita
	- jav
	- jpn
	- kan
	- kat
	- kaz
	- khm
	- kir
	- kor
	- kur
	- lao
	- lat
	- lav
	- lit
	- ltz
	- mal
	- mar
	- mkd
	- mlg
	- mlt
	- mon
	- mri
	- msa
	- msa
	- mya
	- nep
	- nld
	- nor
	- nya
	- pan
	- pol
	- por
	- pus
	- ron
	- rus
	- sin
	- slk
	- slv
	- smo
	- sna
	- snd
	- som
	- sot
	- spa
	- sqi
	- sun
	- swa
	- swe
	- tam
	- tel
	- tgk
	- tha
	- tur
	- ukr
	- urd
	- uzb
	- vie
	- xho
	- yid
	- yor
	- zho
	- zul
	base_model:
	- google/mt5-xxl
	pipeline_tag: text2text-generation
	---

	# MURI-101: Multilingual Instruction-Following Model for 101 languages (mT5-XXL)

	MURI-101 is a multilingual instruction-following model, fine-tuned using a subset of the [MURI-IT](https://huggingface.co/datasets/akoksal/muri-it) dataset. It supports 101 languages and outperforms most multilingual models in both Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks, especially in low-resource settings.

	This model was trained on a dataset with multilingual reverse instructions, ensuring that outputs are culturally and linguistically appropriate for the target language, thus reducing translation artifacts.

	[Paper](https://arxiv.org/abs/2409.12958)

	### Model Architecture
	- Base Model: mT5-XXL
	- Training Data: Subset of MURI-IT
	- Training Setup: Trained with [t5x](https://github.com/google-research/t5x) on 32 TPU v4-32. Batch size: 64, data packing enabled, learning rate: 3e-4 without a scheduler, 5 epochs.

	## Results
	We compare MURI-101 against state-of-the-art models for multilingual instruction following. MURI-101 outperforms most multilingual models, except for Aya, across both NLU and NLG datasets.


	\| \| Okapi \| mT0 \| mT0x \| Aya-101 \| MURI-101 \|
	\|-------------------\|----------------\|--------------\|---------------\|------------------\|---------------------------\|
	\| arb \| 27.7 \| 31.5 \| 31.6 \| 38.2 \| 36.5 \|
	\| ben \| 26.8 \| 31.6 \| 30.2 \| 35.8 \| 33.0 \|
	\| cat \| 30.5 \| 32.8 \| 32.6 \| 39.6 \| 38.8 \|
	\| dan \| 31.8 \| 33.0 \| 32.0 \| 39.7 \| 38.4 \|
	\| deu \| 31.7 \| 32.7 \| 32.5 \| 39.7 \| 38.9 \|
	...
	\| vie \| 27.5 \| 30.9 \| 31.1 \| 34.8 \| 36.8 \|
	\| zho \| 28.2 \| 32.5 \| 31.6 \| 38.3 \| 36.9 \|
	\| Avg. \| 28.8 \| 31.5 \| 30.8 \| 37.3 \| 36.0 \|

	Additionally, our model complements Aya effectively, especially in low-resource settings.

	\| Language \| mT5 \| Aya_1 \| Aya_1 + MURI_1 \|
	\|-------------------\|------\|-------\|----------------\|
	\| aze \| 20.4 \| 37.0 \| 39.5 \|
	\| bel \| 22.4 \| 32.1 \| 33.7 \|
	\| bul \| 20.7 \| 34.4 \| 38.1 \|
	\| cym \| 18.4 \| 33.0 \| 35.5 \|
	\| gla \| 19.3 \| 28.7 \| 35.2 \|
	\| kaz \| 19.8 \| 44.7 \| 46.7 \|
	\| khm \| 16.5 \| 30.0 \| 31.3 \|
	\| lao \| 21.3 \| 32.7 \| 33.0 \|
	\| slk \| 19.2 \| 38.1 \| 39.1 \|
	\| slv \| 18.9 \| 40.3 \| 39.6 \|
	\| Avg. \| 19.7 \| 35.1 \| 37.2 \|



	## Use
	To load and use the model, you can use the following:

	### AutoModelForSeq2SeqLM

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	muri = AutoModelForSeq2SeqLM.from_pretrained("akoksal/muri-101")
	tokenizer = AutoTokenizer.from_pretrained("akoksal/muri-101")

	instruction = "Verilen cümlenin pozitif mi negatif mi olduğunu tahmin edin: Hayatta kesinlikle izlenmemesi gereken filmler kategorisindeki listemin en başına bu filmi koyarım."
	# Turkish to English translation: Guess whether the given sentence is positive or negative: I would put this movie at the very top of the list of movies that absolutely should not be watched in life.
	inputs = tokenizer(instruction, return_tensors="pt").to(device)
	outputs = muri.generate(**inputs, max_new_tokens=5)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	# > negatif
	# (negative)
	```

	### Pipeline

	```python
	from transformers import pipeline

	muri = pipeline("text2text-generation",
	model="akoksal/muri-101")

	muri("""این مقاله را خلاصه کنید
	...تیم دانش‌آموزی کاوش باستانی یک بطری حاوی پیغام ۲۰۰ ساله در شمال فرانسه پیدا کردند""",
	max_new_tokens=150,
	do_sample=True,
	temperature=0.9,
	top_p=0.8)
	# Summarize this article
	# A student team of archeologists found a bottle containing a 200-year-old message in northern France ... [300 words]

	# > در طول سالیان متمادی باستان شناسان فرانسوی تلاش زیادی برای پیدا کردن آثار و اشیای باستانی انجام داده اند اما این بار پیدا شدن بطری حاوی پیغامی به بیش از دو قرن پیش از آن تاریخ نشان می دهد.
	# > Over the years, French archaeologists have made great efforts to find ancient works and objects, but this time, the discovery of a bottle containing a message shows that date more than two centuries ago.
	```

	Thanks to [Google's TRC program](https://sites.research.google/trc/about/) for supporting the training of this model.

	Check out [the paper](https://arxiv.org/abs/2409.12958) for more detailed information on the experiments and results.

	## Citation
	```
	@misc{koksal2024muri,
	title={MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions},
	author={Abdullatif Köksal and Marion Thaler and Ayyoob Imani and Ahmet Üstün and Anna Korhonen and Hinrich Schütze},
	year={2024},
	eprint={2409.12958},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2409.12958},
	}
	```