CohereForAI
/

aya-101

 ---
 license: apache-2.0
+datasets:
+  - CohereForAI/xP3x
+  - CohereForAI/aya_dataset
+  - CohereForAI/aya_collection
+  - DataProvenanceInitiative/Commercially-Verified-Licenses
+  - CohereForAI/aya_evaluation_suite
+language:
+  - afr
+  - amh
+  - ara
+  - aze
+  - bel
+  - ben
+  - bul
+  - cat
+  - ceb
+  - ces
+  - cym
+  - dan
+  - deu
+  - ell
+  - eng
+  - epo
+  - est
+  - eus
+  - fin
+  - fil
+  - fra
+  - fry
+  - gla
+  - gle
+  - glg
+  - guj
+  - hat
+  - hau
+  - heb
+  - hin
+  - hun
+  - hye
+  - ibo
+  - ind
+  - isl
+  - ita
+  - jav
+  - jpn
+  - kan
+  - kat
+  - kaz
+  - khm
+  - kir
+  - kor
+  - kur
+  - lao
+  - lav
+  - lat
+  - lit
+  - ltz
+  - mal
+  - mar
+  - mkd
+  - mlg
+  - mlt
+  - mon
+  - mri
+  - msa
+  - mya
+  - nep
+  - nld
+  - nor
+  - nso
+  - nya
+  - ory
+  - pan
+  - pes
+  - pol
+  - por
+  - pus
+  - ron
+  - rus
+  - sin
+  - slk
+  - slv
+  - smo
+  - sna
+  - snd
+  - som
+  - sot
+  - spa
+  - sqi
+  - srp
+  - sun
+  - swa
+  - swe
+  - tam
+  - tel
+  - tgk
+  - tha
+  - tur
+  - twi
+  - ukr
+  - urd
+  - uzb
+  - vie
+  - xho
+  - yid
+  - yor
+  - zho
+  - zul
+metrics:
+  - accuracy
+  - bleu
 ---
+<img src="aya-fig1.png" alt="Aya model summary image" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
+# Model Card for Aya Model
+## Model Summary
+> The Aya model is a massively multilingual generative language model that follows instructions in 101 languages.
+> Aya outperforms [mT0](https://huggingface.co/bigscience/mt0-xxl) and [BLOOMZ](https://huggingface.co/bigscience/bloomz) a wide variety of automatic and human evaluations despite covering double the number of languages.
+> The Aya model is trained using [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x), [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection), a subset of [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses) and ShareGPT-Command.
+> We release the checkpoints under a Apache-2.0 license to further our mission of multilingual technologies empowering a
+> multilingual world.
+- **Developed by:** Cohere For AI
+- **Model type:** a Transformer style autoregressive massively multilingual language model.
+- **Paper**: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](arxiv.com)
+- **Point of Contact**: [Ahmet Ustun](mailto:[email protected])
+- **Languages**: Refer to the list of languages in the `language` section of this model card.
+- **License**: Apache-2.0
+- **Model**: [Aya](https://huggingface.co/CohereForAI/aya)
+- **Model Size**: 13 billion parameters
+- **Datasets**: [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x), [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection), [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses), ShareGPT-Command.
+## Use
+```bash
+# pip install -q transformers
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+checkpoint = "CohereForAI/aya_model"
+tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+aya_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
+inputs = tokenizer.encode("Translate to English: Je t’aime.", return_tensors="pt")
+outputs = aya_model.generate(inputs)
+print(tokenizer.decode(outputs[0]))
+```
+## Model Details
+### Training
+- Architecture: Same as [mt5-xxl](https://huggingface.co/google/mt5-xxl)
+- Finetuning Steps: 25000
+- Hardware: TPUv4-128
+- Software: T5X, Jax
+### Data Sources
+The Aya model is trained on the following datasets:
+- [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x)
+- [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset)
+- [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection)
+- [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses)
+- ShareGPT-Command
+All datasets are subset to the 101 languages supported by [mT5]. See the [paper](arxiv.com) for details about filtering and pruning.
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+> We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages – including discriminative, generative tasks, human evaluation and simulated win rates that cover both held-out tasks and
+> in-distribution performance.
+Below, we provide evaluation results for the Aya model on unseen discriminative tasks, and in-distribution generative tasks compared to mT0, BLOOMZ, Bactrian-X 13B, and mT0x. To ensure a fair comparison with our Aya model in terms of language coverage, we finetune a new variant of mT5, that we dub mT0x. It is trained using the original datasets that are part of the xP3 collection but extended to 101 languages (xP3x).
+For Multlingual MMLU, Simulated and Human Win-rates, please refer to the [paper](arxiv.com)
+### Discriminative Tasks
+| Model             | Base Model | IFT Mixture | XCOPA (Acc %) | XNLI (Acc %) | XSC (Acc %) | XWG (Acc %) | **<u>Avg</u>** |
+| :---------------- | :--------- | :---------: | :-----------: | :----------: | :---------: | :---------: | :------------: |
+| **46 Languages**  |            |             |               |              |             |             |                |
+| mT0               | mT5 13B    |     xP3     |     75.6      |     55.3     |    87.2     |    73.6     |      72.9      |
+| BLOOMZ            | BLOOM 176B |     xP3     |     64.3      |     52.0     |    82.6     |    63.3     |      65.5      |
+| **52 Languages**  |            |             |               |              |             |             |                |
+| Bactrian-X 13B    | Llama 13B  | Bactrian-X  |     52.4      |     34.5     |    51.8     |    50.5     |      47.3      |
+| **101 Languages** |            |             |               |              |             |             |                |
+| mT0x              | mT5 13B    |    xP3x     |     71.7      |     45.9     |    85.1     |    60.6     |      65.8      |
+| Aya model         | mT5 13B    | All Mixture |     76.7      |     58.3     |    90.0     |    70.7     |      73.9      |
+### Generative Tasks
+| Model             | Base Model | IFT Mixture | FLORES-200 (spBleu) | FLORES-200 (spBleu) | XLSum (RougeLsum) | Tydi-QA (F1) |
+| :---------------- | :--------: | :---------- | :-----------------: | :-----------------: | :---------------: | :----------: |
+|                   |            |             |        X→ En        |       En → X        |                   |              |
+| **101 Languages** |            |             |                     |                     |                   |              |
+| mT0x              |  mT5 13B   | xP3x        |        20.2         |        14.5         |       21.4        |     76.1     |
+| Aya Model         |  mT5 13B   | All Mixture |        29.1         |        19.0         |       22.0        |     77.8     |
+Note: We cannot compare mT0, and BLOOMZ for the above generative tasks, as the validation splits are part of mT0 and BLOOMZ's training data.
+## Bias, Risks, and Limitations
+Like any base language model or fine-tuned model without safety filtering, it is relatively easy for a user to prompt these models to generate harmful and generally sensitive content.
+Aya model, as released, does not include any safety filtering.
+We hope that the release of the Aya model will make community-based redteaming efforts possible, by exposing an open-source massively-multilingual model for community research.
+For a detailed overview of our effort at safety mitigation and benchmarking toxicity and bias across multiple languages, we refer Sections 6 and 7 of our paper: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](arxiv.com).
+## Citation
+**BibTeX:**
+```
+@article{,
+  title={},
+  author={},
+  journal={Preprint},
+  year={2024}
+}
+```
+**APA:**
+## Languages Covered
+Below is the list of languages used in finetuning the Aya Model. We group languages into higher-, mid-, and lower-resourcedness based on a language classification by [Joshi et. al, 2020](https://microsoft.github.io/linguisticdiversity/). For further details, refer to our [paper]()
+| ISO Code | Language Name   |    Script    |     Family      |    Subgrouping    | Resourcedness |
+| :------- | :-------------- | :----------: | :-------------: | :---------------: | :-----------: |
+| afr      | Afrikaans       |    Latin     |  Indo-European  |     Germanic      |      Mid      |
+| amh      | Amharic         |    Ge'ez     |  Afro-Asiatic   |      Semitic      |      Low      |
+| ara      | Arabic          |    Arabic    |  Afro-Asiatic   |      Semitic      |     High      |
+| aze      | Azerbaijani     | Arabic/Latin |     Turkic      |   Common Turkic   |      Low      |
+| bel      | Belarusian      |   Cyrillic   |  Indo-European  |   Balto-Slavic    |      Mid      |
+| ben      | Bengali         |   Bengali    |  Indo-European  |    Indo-Aryan     |      Mid      |
+| bul      | Bulgarian       |   Cyrillic   |  Indo-European  |   Balto-Slavic    |      Mid      |
+| cat      | Catalan         |    Latin     |  Indo-European  |      Italic       |     High      |
+| ceb      | Cebuano         |    Latin     |  Austronesian   | Malayo-Polynesian |      Mid      |
+| ces      | Czech           |    Latin     |  Indo-European  |   Balto-Slavic    |     High      |
+| cym      | Welsh           |    Latin     |  Indo-European  |      Celtic       |      Low      |
+| dan      | Danish          |    Latin     |  Indo-European  |     Germanic      |      Mid      |
+| deu      | German          |    Latin     |  Indo-European  |     Germanic      |     High      |
+| ell      | Greek           |    Greek     |  Indo-European  |  Graeco-Phrygian  |      Mid      |
+| eng      | English         |    Latin     |  Indo-European  |     Germanic      |     High      |
+| epo      | Esperanto       |    Latin     |   Constructed   |    Esperantic     |      Low      |
+| est      | Estonian        |    Latin     |     Uralic      |      Finnic       |      Mid      |
+| eus      | Basque          |    Latin     |     Basque      |         -         |     High      |
+| fin      | Finnish         |    Latin     |     Uralic      |      Finnic       |     High      |
+| fil      | Tagalog         |    Latin     |  Austronesian   | Malayo-Polynesian |      Mid      |
+| fra      | French          |    Latin     |  Indo-European  |      Italic       |     High      |
+| fry      | Western Frisian |    Latin     |  Indo-European  |     Germanic      |      Low      |
+| gla      | Scottish Gaelic |    Latin     |  Indo-European  |      Celtic       |      Low      |
+| gle      | Irish           |    Latin     |  Indo-European  |      Celtic       |      Low      |
+| glg      | Galician        |    Latin     |  Indo-European  |      Italic       |      Mid      |
+| guj      | Gujarati        |   Gujarati   |  Indo-European  |    Indo-Aryan     |      Low      |
+| hat      | Haitian Creole  |    Latin     |  Indo-European  |      Italic       |      Low      |
+| hau      | Hausa           |    Latin     |  Afro-Asiatic   |      Chadic       |      Low      |
+| heb      | Hebrew          |    Hebrew    |  Afro-Asiatic   |      Semitic      |      Mid      |
+| hin      | Hindi           |  Devanagari  |  Indo-European  |    Indo-Aryan     |     High      |
+| hun      | Hungarian       |    Latin     |     Uralic      |         -         |     High      |
+| hye      | Armenian        |   Armenian   |  Indo-European  |      Armenic      |      Low      |
+| ibo      | Igbo            |    Latin     | Atlantic-Congo  |    Benue-Congo    |      Low      |
+| ind      | Indonesian      |    Latin     |  Austronesian   | Malayo-Polynesian |      Mid      |
+| isl      | Icelandic       |    Latin     |  Indo-European  |     Germanic      |      Low      |
+| ita      | Italian         |    Latin     |  Indo-European  |      Italic       |     High      |
+| jav      | Javanese        |    Latin     |  Austronesian   | Malayo-Polynesian |      Low      |
+| jpn      | Japanese        |   Japanese   |     Japonic     |     Japanesic     |     High      |
+| kan      | Kannada         |   Kannada    |    Dravidian    |  South Dravidian  |      Low      |
+| kat      | Georgian        |   Georgian   |   Kartvelian    |   Georgian-Zan    |      Mid      |
+| kaz      | Kazakh          |   Cyrillic   |     Turkic      |   Common Turkic   |      Mid      |
+| khm      | Khmer           |    Khmer     |  Austroasiatic  |      Khmeric      |      Low      |
+| kir      | Kyrgyz          |   Cyrillic   |     Turkic      |   Common Turkic   |      Low      |
+| kor      | Korean          |    Hangul    |    Koreanic     |      Korean       |     High      |
+| kur      | Kurdish         |    Latin     |  Indo-European  |      Iranian      |      Low      |
+| lao      | Lao             |     Lao      |    Tai-Kadai    |      Kam-Tai      |      Low      |
+| lav      | Latvian         |    Latin     |  Indo-European  |   Balto-Slavic    |      Mid      |
+| lat      | Latin           |    Latin     |  Indo-European  |      Italic       |      Mid      |
+| lit      | Lithuanian      |    Latin     |  Indo-European  |   Balto-Slavic    |      Mid      |
+| ltz      | Luxembourgish   |    Latin     |  Indo-European  |     Germanic      |      Low      |
+| mal      | Malayalam       |  Malayalam   |    Dravidian    |  South Dravidian  |      Low      |
+| mar      | Marathi         |  Devanagari  |  Indo-European  |    Indo-Aryan     |      Low      |
+| mkd      | Macedonian      |   Cyrillic   |  Indo-European  |   Balto-Slavic    |      Low      |
+| mlg      | Malagasy        |    Latin     |  Austronesian   | Malayo-Polynesian |      Low      |
+| mlt      | Maltese         |    Latin     |  Afro-Asiatic   |      Semitic      |      Low      |
+| mon      | Mongolian       |   Cyrillic   | Mongolic-Khitan |     Mongolic      |      Low      |
+| mri      | Maori           |    Latin     |  Austronesian   | Malayo-Polynesian |      Low      |
+| msa      | Malay           |    Latin     |  Austronesian   | Malayo-Polynesian |      Mid      |
+| mya      | Burmese         |   Myanmar    |  Sino-Tibetan   |   Burmo-Qiangic   |      Low      |
+| nep      | Nepali          |  Devanagari  |  Indo-European  |    Indo-Aryan     |      Low      |
+| nld      | Dutch           |    Latin     |  Indo-European  |     Germanic      |     High      |
+| nor      | Norwegian       |    Latin     |  Indo-European  |     Germanic      |      Low      |
+| nso      | Northern Sotho  |    Latin     | Atlantic-Congo  |    Benue-Congo    |      Low      |
+| nya      | Chichewa        |    Latin     | Atlantic-Congo  |    Benue-Congo    |      Low      |
+| ory      | Oriya           |    Oriya     |  Indo-European  |    Indo-Aryan     |      Low      |
+| pan      | Punjabi         |   Gurmukhi   |  Indo-European  |    Indo-Aryan     |      Low      |
+| pes      | Persian         |    Arabic    |  Indo-European  |      Iranian      |     High      |
+| pol      | Polish          |    Latin     |  Indo-European  |   Balto-Slavic    |     High      |
+| por      | Portuguese      |    Latin     |  Indo-European  |      Italic       |     High      |
+| pus      | Pashto          |    Arabic    |  Indo-European  |      Iranian      |      Low      |
+| ron      | Romanian        |    Latin     |  Indo-European  |      Italic       |      Mid      |
+| rus      | Russian         |   Cyrillic   |  Indo-European  |   Balto-Slavic    |     High      |
+| sin      | Sinhala         |   Sinhala    |  Indo-European  |    Indo-Aryan     |      Low      |
+| slk      | Slovak          |    Latin     |  Indo-European  |   Balto-Slavic    |      Mid      |
+| slv      | Slovenian       |    Latin     |  Indo-European  |   Balto-Slavic    |      Mid      |
+| smo      | Samoan          |    Latin     |  Austronesian   | Malayo-Polynesian |      Low      |
+| sna      | Shona           |    Latin     |  Indo-European  |    Indo-Aryan     |      Low      |
+| snd      | Sindhi          |    Arabic    |  Indo-European  |    Indo-Aryan     |      Low      |
+| som      | Somali          |    Latin     |  Afro-Asiatic   |     Cushitic      |      Low      |
+| sot      | Southern Sotho  |    Latin     | Atlantic-Congo  |    Benue-Congo    |      Low      |
+| spa      | Spanish         |    Latin     |  Indo-European  |      Italic       |     High      |
+| sqi      | Albanian        |    Latin     |  Indo-European  |     Albanian      |      Low      |
+| srp      | Serbian         |   Cyrillic   |  Indo-European  |   Balto-Slavic    |     High      |
+| sun      | Sundanese       |    Latin     |  Austronesian   | Malayo-Polynesian |      Low      |
+| swa      | Swahili         |    Latin     | Atlantic-Congo  |    Benue-Congo    |      Low      |
+| swe      | Swedish         |    Latin     |  Indo-European  |     Germanic      |     High      |
+| tam      | Tamil           |    Tamil     |    Dravidian    |  South Dravidian  |      Mid      |
+| tel      | Telugu          |    Telugu    |    Dravidian    |  South Dravidian  |      Low      |
+| tgk      | Tajik           |   Cyrillic   |  Indo-European  |      Iranian      |      Low      |
+| tha      | Thai            |     Thai     |    Tai-Kadai    |      Kam-Tai      |      Mid      |
+| tur      | Turkish         |    Latin     |     Turkic      |   Common Turkic   |     High      |
+| twi      | Twi             |    Latin     | Atlantic-Congo  |    Niger-Congo    |      Low      |
+| ukr      | Ukrainian       |   Cyrillic   |  Indo-European  |   Balto-Slavic    |      Mid      |
+| urd      | Urdu            |    Arabic    |  Indo-European  |    Indo-Aryan     |      Mid      |
+| uzb      | Uzbek           |    Latin     |     Turkic      |   Common Turkic   |      Mid      |
+| vie      | Vietnamese      |    Latin     |  Austroasiatic  |      Vietic       |     High      |
+| xho      | Xhosa           |    Latin     | Atlantic-Congo  |    Benue-Congo    |      Low      |
+| yid      | Yiddish         |    Hebrew    |  Indo-European  |     Germanic      |      Low      |
+| yor      | Yoruba          |    Latin     | Atlantic-Congo  |    Benue-Congo    |      Low      |
+| zho      | Chinese         |     Han      |  Sino-Tibetan   |      Sinitic      |     High      |
+| zul      | Zulu            |    Latin     | Atlantic-Congo  |    Benue-Congo    |      Low      |
+## Model Card Contact
+For errors in this model card, contact Ahmet or Viraat, `{ahmet, viraat} at cohere dot com`.