File size: 6,439 Bytes
0856e55 7c919ff d854ec6 7c919ff d854ec6 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 26fec21 7c919ff 7a3a0a0 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 0856e55 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 7c919ff 1c691cf 26fec21 1c691cf 7c919ff 26fec21 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
---
license: apache-2.0
datasets:
- akoksal/muri-it
language:
- afr
- amh
- ara
- aze
- bel
- ben
- bul
- cat
- ceb
- ces
- cos
- cym
- dan
- deu
- ell
- eng
- epo
- est
- eus
- fas
- fin
- fra
- fry
- gla
- gle
- glg
- guj
- hat
- hau
- haw
- hbs
- heb
- hin
- hun
- hye
- ibo
- isl
- ita
- jav
- jpn
- kan
- kat
- kaz
- khm
- kir
- kor
- kur
- lao
- lat
- lav
- lit
- ltz
- mal
- mar
- mkd
- mlg
- mlt
- mon
- mri
- msa
- msa
- mya
- nep
- nld
- nor
- nya
- pan
- pol
- por
- pus
- ron
- rus
- sin
- slk
- slv
- smo
- sna
- snd
- som
- sot
- spa
- sqi
- sun
- swa
- swe
- tam
- tel
- tgk
- tha
- tur
- ukr
- urd
- uzb
- vie
- xho
- yid
- yor
- zho
- zul
base_model:
- google/mt5-xxl
pipeline_tag: text2text-generation
---
# MURI-101: Multilingual Instruction-Following Model for 101 languages (mT5-XXL)
MURI-101 is a multilingual instruction-following model, fine-tuned using a subset of the [**MURI-IT**](https://huggingface.co/datasets/akoksal/muri-it) dataset. It supports **101 languages** and outperforms most multilingual models in both **Natural Language Understanding (NLU)** and **Natural Language Generation (NLG)** tasks, especially in low-resource settings.
This model was trained on a dataset with multilingual reverse instructions, ensuring that outputs are culturally and linguistically appropriate for the target language, thus reducing translation artifacts.
[Paper](https://arxiv.org/abs/2409.12958)
### Model Architecture
- **Base Model**: mT5-XXL
- **Training Data**: Subset of MURI-IT
- **Training Setup**: Trained with [t5x](https://github.com/google-research/t5x) on 32 TPU v4-32. Batch size: 64, data packing enabled, learning rate: 3e-4 without a scheduler, 5 epochs.
## Results
We compare **MURI-101** against state-of-the-art models for multilingual instruction following. MURI-101 outperforms most multilingual models, except for Aya, across both NLU and NLG datasets.
| | Okapi | mT0 | mT0x | Aya-101 | MURI-101 |
|-------------------|----------------|--------------|---------------|------------------|---------------------------|
| arb | 27.7 | 31.5 | 31.6 | 38.2 | 36.5 |
| ben | 26.8 | 31.6 | 30.2 | 35.8 | 33.0 |
| cat | 30.5 | 32.8 | 32.6 | 39.6 | 38.8 |
| dan | 31.8 | 33.0 | 32.0 | 39.7 | 38.4 |
| deu | 31.7 | 32.7 | 32.5 | 39.7 | 38.9 |
...
| vie | 27.5 | 30.9 | 31.1 | 34.8 | 36.8 |
| zho | 28.2 | 32.5 | 31.6 | 38.3 | 36.9 |
| Avg. | 28.8 | 31.5 | 30.8 | 37.3 | 36.0 |
Additionally, our model complements Aya effectively, especially in low-resource settings.
| Language | mT5 | Aya_1 | Aya_1 + MURI_1 |
|-------------------|------|-------|----------------|
| aze | 20.4 | 37.0 | 39.5 |
| bel | 22.4 | 32.1 | 33.7 |
| bul | 20.7 | 34.4 | 38.1 |
| cym | 18.4 | 33.0 | 35.5 |
| gla | 19.3 | 28.7 | 35.2 |
| kaz | 19.8 | 44.7 | 46.7 |
| khm | 16.5 | 30.0 | 31.3 |
| lao | 21.3 | 32.7 | 33.0 |
| slk | 19.2 | 38.1 | 39.1 |
| slv | 18.9 | 40.3 | 39.6 |
| Avg. | 19.7 | 35.1 | **37.2** |
## Use
To load and use the model, you can use the following:
### AutoModelForSeq2SeqLM
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
muri = AutoModelForSeq2SeqLM.from_pretrained("akoksal/muri-101")
tokenizer = AutoTokenizer.from_pretrained("akoksal/muri-101")
instruction = "Verilen cümlenin pozitif mi negatif mi olduğunu tahmin edin: Hayatta kesinlikle izlenmemesi gereken filmler kategorisindeki listemin en başına bu filmi koyarım."
# Turkish to English translation: Guess whether the given sentence is positive or negative: I would put this movie at the very top of the list of movies that absolutely should not be watched in life.
inputs = tokenizer(instruction, return_tensors="pt").to(device)
outputs = muri.generate(**inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# > negatif
# (negative)
```
### Pipeline
```python
from transformers import pipeline
muri = pipeline("text2text-generation",
model="akoksal/muri-101")
muri("""این مقاله را خلاصه کنید
...تیم دانشآموزی کاوش باستانی یک بطری حاوی پیغام ۲۰۰ ساله در شمال فرانسه پیدا کردند""",
max_new_tokens=150,
do_sample=True,
temperature=0.9,
top_p=0.8)
# Summarize this article
# A student team of archeologists found a bottle containing a 200-year-old message in northern France ... [300 words]
# > در طول سالیان متمادی باستان شناسان فرانسوی تلاش زیادی برای پیدا کردن آثار و اشیای باستانی انجام داده اند اما این بار پیدا شدن بطری حاوی پیغامی به بیش از دو قرن پیش از آن تاریخ نشان می دهد.
# > Over the years, French archaeologists have made great efforts to find ancient works and objects, but this time, the discovery of a bottle containing a message shows that date more than two centuries ago.
```
Thanks to [Google's TRC program](https://sites.research.google/trc/about/) for supporting the training of this model.
Check out [the paper](https://arxiv.org/abs/2409.12958) for more detailed information on the experiments and results.
## Citation
```
@misc{koksal2024muri,
title={MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions},
author={Abdullatif Köksal and Marion Thaler and Ayyoob Imani and Ahmet Üstün and Anna Korhonen and Hinrich Schütze},
year={2024},
eprint={2409.12958},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.12958},
}
``` |