--- license: apache-2.0 datasets: - akoksal/muri-it language: - afr - amh - ara - aze - bel - ben - bul - cat - ceb - ces - cos - cym - dan - deu - ell - eng - epo - est - eus - fas - fin - fra - fry - gla - gle - glg - guj - hat - hau - haw - hbs - heb - hin - hun - hye - ibo - isl - ita - jav - jpn - kan - kat - kaz - khm - kir - kor - kur - lao - lat - lav - lit - ltz - mal - mar - mkd - mlg - mlt - mon - mri - msa - msa - mya - nep - nld - nor - nya - pan - pol - por - pus - ron - rus - sin - slk - slv - smo - sna - snd - som - sot - spa - sqi - sun - swa - swe - tam - tel - tgk - tha - tur - ukr - urd - uzb - vie - xho - yid - yor - zho - zul base_model: - google/mt5-xxl pipeline_tag: text2text-generation --- # MURI-101: Multilingual Instruction-Following Model for 101 languages (mT5-XXL) MURI-101 is a multilingual instruction-following model, fine-tuned using a subset of the [**MURI-IT**](https://huggingface.co/datasets/akoksal/muri-it) dataset. It supports **101 languages** and outperforms most multilingual models in both **Natural Language Understanding (NLU)** and **Natural Language Generation (NLG)** tasks, especially in low-resource settings. This model was trained on a dataset with multilingual reverse instructions, ensuring that outputs are culturally and linguistically appropriate for the target language, thus reducing translation artifacts. [Paper](https://arxiv.org/abs/2409.12958) ### Model Architecture - **Base Model**: mT5-XXL - **Training Data**: Subset of MURI-IT - **Training Setup**: Trained with [t5x](https://github.com/google-research/t5x) on 32 TPU v4-32. Batch size: 64, data packing enabled, learning rate: 3e-4 without a scheduler, 5 epochs. ## Results We compare **MURI-101** against state-of-the-art models for multilingual instruction following. MURI-101 outperforms most multilingual models, except for Aya, across both NLU and NLG datasets. | | Okapi | mT0 | mT0x | Aya-101 | MURI-101 | |-------------------|----------------|--------------|---------------|------------------|---------------------------| | arb | 27.7 | 31.5 | 31.6 | 38.2 | 36.5 | | ben | 26.8 | 31.6 | 30.2 | 35.8 | 33.0 | | cat | 30.5 | 32.8 | 32.6 | 39.6 | 38.8 | | dan | 31.8 | 33.0 | 32.0 | 39.7 | 38.4 | | deu | 31.7 | 32.7 | 32.5 | 39.7 | 38.9 | ... | vie | 27.5 | 30.9 | 31.1 | 34.8 | 36.8 | | zho | 28.2 | 32.5 | 31.6 | 38.3 | 36.9 | | Avg. | 28.8 | 31.5 | 30.8 | 37.3 | 36.0 | Additionally, our model complements Aya effectively, especially in low-resource settings. | Language | mT5 | Aya_1 | Aya_1 + MURI_1 | |-------------------|------|-------|----------------| | aze | 20.4 | 37.0 | 39.5 | | bel | 22.4 | 32.1 | 33.7 | | bul | 20.7 | 34.4 | 38.1 | | cym | 18.4 | 33.0 | 35.5 | | gla | 19.3 | 28.7 | 35.2 | | kaz | 19.8 | 44.7 | 46.7 | | khm | 16.5 | 30.0 | 31.3 | | lao | 21.3 | 32.7 | 33.0 | | slk | 19.2 | 38.1 | 39.1 | | slv | 18.9 | 40.3 | 39.6 | | Avg. | 19.7 | 35.1 | **37.2** | ## Use To load and use the model, you can use the following: ### AutoModelForSeq2SeqLM ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM muri = AutoModelForSeq2SeqLM.from_pretrained("akoksal/muri-101") tokenizer = AutoTokenizer.from_pretrained("akoksal/muri-101") instruction = "Verilen cümlenin pozitif mi negatif mi olduğunu tahmin edin: Hayatta kesinlikle izlenmemesi gereken filmler kategorisindeki listemin en başına bu filmi koyarım." # Turkish to English translation: Guess whether the given sentence is positive or negative: I would put this movie at the very top of the list of movies that absolutely should not be watched in life. inputs = tokenizer(instruction, return_tensors="pt").to(device) outputs = muri.generate(**inputs, max_new_tokens=5) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # > negatif # (negative) ``` ### Pipeline ```python from transformers import pipeline muri = pipeline("text2text-generation", model="akoksal/muri-101") muri("""این مقاله را خلاصه کنید ...تیم دانش‌آموزی کاوش باستانی یک بطری حاوی پیغام ۲۰۰ ساله در شمال فرانسه پیدا کردند""", max_new_tokens=150, do_sample=True, temperature=0.9, top_p=0.8) # Summarize this article # A student team of archeologists found a bottle containing a 200-year-old message in northern France ... [300 words] # > در طول سالیان متمادی باستان شناسان فرانسوی تلاش زیادی برای پیدا کردن آثار و اشیای باستانی انجام داده اند اما این بار پیدا شدن بطری حاوی پیغامی به بیش از دو قرن پیش از آن تاریخ نشان می دهد. # > Over the years, French archaeologists have made great efforts to find ancient works and objects, but this time, the discovery of a bottle containing a message shows that date more than two centuries ago. ``` Thanks to [Google's TRC program](https://sites.research.google/trc/about/) for supporting the training of this model. Check out [the paper](https://arxiv.org/abs/2409.12958) for more detailed information on the experiments and results. ## Citation ``` @misc{koksal2024muri, title={MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions}, author={Abdullatif Köksal and Marion Thaler and Ayyoob Imani and Ahmet Üstün and Anna Korhonen and Hinrich Schütze}, year={2024}, eprint={2409.12958}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.12958}, } ```