muri-101 / README.md
akoksal's picture
Update README.md
26fec21 verified
---
license: apache-2.0
datasets:
- akoksal/muri-it
language:
- afr
- amh
- ara
- aze
- bel
- ben
- bul
- cat
- ceb
- ces
- cos
- cym
- dan
- deu
- ell
- eng
- epo
- est
- eus
- fas
- fin
- fra
- fry
- gla
- gle
- glg
- guj
- hat
- hau
- haw
- hbs
- heb
- hin
- hun
- hye
- ibo
- isl
- ita
- jav
- jpn
- kan
- kat
- kaz
- khm
- kir
- kor
- kur
- lao
- lat
- lav
- lit
- ltz
- mal
- mar
- mkd
- mlg
- mlt
- mon
- mri
- msa
- msa
- mya
- nep
- nld
- nor
- nya
- pan
- pol
- por
- pus
- ron
- rus
- sin
- slk
- slv
- smo
- sna
- snd
- som
- sot
- spa
- sqi
- sun
- swa
- swe
- tam
- tel
- tgk
- tha
- tur
- ukr
- urd
- uzb
- vie
- xho
- yid
- yor
- zho
- zul
base_model:
- google/mt5-xxl
pipeline_tag: text2text-generation
---
# MURI-101: Multilingual Instruction-Following Model for 101 languages (mT5-XXL)
MURI-101 is a multilingual instruction-following model, fine-tuned using a subset of the [**MURI-IT**](https://huggingface.co/datasets/akoksal/muri-it) dataset. It supports **101 languages** and outperforms most multilingual models in both **Natural Language Understanding (NLU)** and **Natural Language Generation (NLG)** tasks, especially in low-resource settings.
This model was trained on a dataset with multilingual reverse instructions, ensuring that outputs are culturally and linguistically appropriate for the target language, thus reducing translation artifacts.
[Paper](https://arxiv.org/abs/2409.12958)
### Model Architecture
- **Base Model**: mT5-XXL
- **Training Data**: Subset of MURI-IT
- **Training Setup**: Trained with [t5x](https://github.com/google-research/t5x) on 32 TPU v4-32. Batch size: 64, data packing enabled, learning rate: 3e-4 without a scheduler, 5 epochs.
## Results
We compare **MURI-101** against state-of-the-art models for multilingual instruction following. MURI-101 outperforms most multilingual models, except for Aya, across both NLU and NLG datasets.
| | Okapi | mT0 | mT0x | Aya-101 | MURI-101 |
|-------------------|----------------|--------------|---------------|------------------|---------------------------|
| arb | 27.7 | 31.5 | 31.6 | 38.2 | 36.5 |
| ben | 26.8 | 31.6 | 30.2 | 35.8 | 33.0 |
| cat | 30.5 | 32.8 | 32.6 | 39.6 | 38.8 |
| dan | 31.8 | 33.0 | 32.0 | 39.7 | 38.4 |
| deu | 31.7 | 32.7 | 32.5 | 39.7 | 38.9 |
...
| vie | 27.5 | 30.9 | 31.1 | 34.8 | 36.8 |
| zho | 28.2 | 32.5 | 31.6 | 38.3 | 36.9 |
| Avg. | 28.8 | 31.5 | 30.8 | 37.3 | 36.0 |
Additionally, our model complements Aya effectively, especially in low-resource settings.
| Language | mT5 | Aya_1 | Aya_1 + MURI_1 |
|-------------------|------|-------|----------------|
| aze | 20.4 | 37.0 | 39.5 |
| bel | 22.4 | 32.1 | 33.7 |
| bul | 20.7 | 34.4 | 38.1 |
| cym | 18.4 | 33.0 | 35.5 |
| gla | 19.3 | 28.7 | 35.2 |
| kaz | 19.8 | 44.7 | 46.7 |
| khm | 16.5 | 30.0 | 31.3 |
| lao | 21.3 | 32.7 | 33.0 |
| slk | 19.2 | 38.1 | 39.1 |
| slv | 18.9 | 40.3 | 39.6 |
| Avg. | 19.7 | 35.1 | **37.2** |
## Use
To load and use the model, you can use the following:
### AutoModelForSeq2SeqLM
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
muri = AutoModelForSeq2SeqLM.from_pretrained("akoksal/muri-101")
tokenizer = AutoTokenizer.from_pretrained("akoksal/muri-101")
instruction = "Verilen cümlenin pozitif mi negatif mi olduğunu tahmin edin: Hayatta kesinlikle izlenmemesi gereken filmler kategorisindeki listemin en başına bu filmi koyarım."
# Turkish to English translation: Guess whether the given sentence is positive or negative: I would put this movie at the very top of the list of movies that absolutely should not be watched in life.
inputs = tokenizer(instruction, return_tensors="pt").to(device)
outputs = muri.generate(**inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# > negatif
# (negative)
```
### Pipeline
```python
from transformers import pipeline
muri = pipeline("text2text-generation",
model="akoksal/muri-101")
muri("""این مقاله را خلاصه کنید
...تیم دانش‌آموزی کاوش باستانی یک بطری حاوی پیغام ۲۰۰ ساله در شمال فرانسه پیدا کردند""",
max_new_tokens=150,
do_sample=True,
temperature=0.9,
top_p=0.8)
# Summarize this article
# A student team of archeologists found a bottle containing a 200-year-old message in northern France ... [300 words]
# > در طول سالیان متمادی باستان شناسان فرانسوی تلاش زیادی برای پیدا کردن آثار و اشیای باستانی انجام داده اند اما این بار پیدا شدن بطری حاوی پیغامی به بیش از دو قرن پیش از آن تاریخ نشان می دهد.
# > Over the years, French archaeologists have made great efforts to find ancient works and objects, but this time, the discovery of a bottle containing a message shows that date more than two centuries ago.
```
Thanks to [Google's TRC program](https://sites.research.google/trc/about/) for supporting the training of this model.
Check out [the paper](https://arxiv.org/abs/2409.12958) for more detailed information on the experiments and results.
## Citation
```
@misc{koksal2024muri,
title={MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions},
author={Abdullatif Köksal and Marion Thaler and Ayyoob Imani and Ahmet Üstün and Anna Korhonen and Hinrich Schütze},
year={2024},
eprint={2409.12958},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.12958},
}
```