Files changed (6) hide show
  1. .gitattributes +1 -1
  2. README.md +0 -320
  3. aya-fig1.png +0 -3
  4. config.json +3 -33
  5. generation_config.json +3 -7
  6. tokenizer_config.json +3 -38
.gitattributes CHANGED
@@ -33,4 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
- tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,323 +1,3 @@
1
  ---
2
  license: apache-2.0
3
- datasets:
4
- - CohereForAI/xP3x
5
- - CohereForAI/aya_dataset
6
- - CohereForAI/aya_collection
7
- - DataProvenanceInitiative/Commercially-Verified-Licenses
8
- - CohereForAI/aya_evaluation_suite
9
- language:
10
- - afr
11
- - amh
12
- - ara
13
- - aze
14
- - bel
15
- - ben
16
- - bul
17
- - cat
18
- - ceb
19
- - ces
20
- - cym
21
- - dan
22
- - deu
23
- - ell
24
- - eng
25
- - epo
26
- - est
27
- - eus
28
- - fin
29
- - fil
30
- - fra
31
- - fry
32
- - gla
33
- - gle
34
- - glg
35
- - guj
36
- - hat
37
- - hau
38
- - heb
39
- - hin
40
- - hun
41
- - hye
42
- - ibo
43
- - ind
44
- - isl
45
- - ita
46
- - jav
47
- - jpn
48
- - kan
49
- - kat
50
- - kaz
51
- - khm
52
- - kir
53
- - kor
54
- - kur
55
- - lao
56
- - lav
57
- - lat
58
- - lit
59
- - ltz
60
- - mal
61
- - mar
62
- - mkd
63
- - mlg
64
- - mlt
65
- - mon
66
- - mri
67
- - msa
68
- - mya
69
- - nep
70
- - nld
71
- - nor
72
- - nso
73
- - nya
74
- - ory
75
- - pan
76
- - pes
77
- - pol
78
- - por
79
- - pus
80
- - ron
81
- - rus
82
- - sin
83
- - slk
84
- - slv
85
- - smo
86
- - sna
87
- - snd
88
- - som
89
- - sot
90
- - spa
91
- - sqi
92
- - srp
93
- - sun
94
- - swa
95
- - swe
96
- - tam
97
- - tel
98
- - tgk
99
- - tha
100
- - tur
101
- - twi
102
- - ukr
103
- - urd
104
- - uzb
105
- - vie
106
- - xho
107
- - yid
108
- - yor
109
- - zho
110
- - zul
111
- metrics:
112
- - accuracy
113
- - bleu
114
  ---
115
-
116
- <img src="aya-fig1.png" alt="Aya model summary image" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
117
-
118
- # Model Card for Aya 101
119
-
120
- ## Model Summary
121
-
122
- > The Aya model is a massively multilingual generative language model that follows instructions in 101 languages.
123
- > Aya outperforms [mT0](https://huggingface.co/bigscience/mt0-xxl) and [BLOOMZ](https://huggingface.co/bigscience/bloomz) a wide variety of automatic and human evaluations despite covering double the number of languages.
124
- > The Aya model is trained using [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x), [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection), a subset of [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses) and ShareGPT-Command.
125
- > We release the checkpoints under a Apache-2.0 license to further our mission of multilingual technologies empowering a
126
- > multilingual world.
127
-
128
- - **Developed by:** [Cohere For AI](https://cohere.for.ai)
129
- - **Model type:** a Transformer style autoregressive massively multilingual language model.
130
- - **Paper**: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](https://arxiv.org/abs/2402.07827)
131
- - **Point of Contact**: Cohere For AI: [cohere.for.ai](https://cohere.for.ai)
132
- - **Languages**: Refer to the list of languages in the `language` section of this model card.
133
- - **License**: Apache-2.0
134
- - **Model**: [Aya-101](https://huggingface.co/CohereForAI/aya-101)
135
- - **Model Size**: 13 billion parameters
136
- - **Datasets**: [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x), [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection), [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses), ShareGPT-Command.
137
-
138
- ## Use
139
-
140
- ```python
141
- # pip install -q transformers
142
- from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
143
-
144
- checkpoint = "CohereForAI/aya-101"
145
-
146
- tokenizer = AutoTokenizer.from_pretrained(checkpoint)
147
- aya_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
148
-
149
- # Turkish to English translation
150
- tur_inputs = tokenizer.encode("Translate to English: Aya cok dilli bir dil modelidir.", return_tensors="pt")
151
- tur_outputs = aya_model.generate(tur_inputs, max_new_tokens=128)
152
- print(tokenizer.decode(tur_outputs[0]))
153
- # Aya is a multi-lingual language model
154
-
155
- # Q: Why are there so many languages in India?
156
- hin_inputs = tokenizer.encode("भारत में इतनी सारी भाषाएँ क्यों हैं?", return_tensors="pt")
157
- hin_outputs = aya_model.generate(hin_inputs, max_new_tokens=128)
158
- print(tokenizer.decode(hin_outputs[0]))
159
- # Expected output: भारत में कई भाषाएँ हैं और विभिन्न भाषाओं के बोली जाने वाले लोग हैं। यह विभिन्नता भाषाई विविधता और सांस्कृतिक विविधता का परिणाम है। Translates to "India has many languages and people speaking different languages. This diversity is the result of linguistic diversity and cultural diversity."
160
-
161
- ```
162
-
163
- ## Model Details
164
-
165
- ### Finetuning
166
-
167
- - Architecture: Same as [mt5-xxl](https://huggingface.co/google/mt5-xxl)
168
- - Number of Samples seen during Finetuning: 25M
169
- - Batch size: 256
170
- - Hardware: TPUv4-128
171
- - Software: T5X, Jax
172
-
173
- ### Data Sources
174
-
175
- The Aya model is trained on the following datasets:
176
-
177
- - [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x)
178
- - [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset)
179
- - [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection)
180
- - [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses)
181
- - ShareGPT-Command
182
-
183
- All datasets are subset to the 101 languages supported by [mT5](https://huggingface.co/google/mt5-xxl). See the [paper](https://arxiv.org/abs/2402.07827) for details about filtering and pruning.
184
-
185
- ## Evaluation
186
-
187
- We refer to Section 5 from our paper for multilingual eval across 99 languages – including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance.
188
-
189
- ## Bias, Risks, and Limitations
190
-
191
-
192
- For a detailed overview of our effort at safety mitigation and benchmarking toxicity and bias across multiple languages, we refer to Sections 6 and 7 of our paper: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](https://arxiv.org/abs/2402.07827).
193
-
194
- We hope that the release of the Aya model will make community-based redteaming efforts possible, by exposing an open-source massively-multilingual model for community research.
195
-
196
- ## Citation
197
-
198
- **BibTeX:**
199
-
200
- ```
201
- @article{üstün2024aya,
202
- title={Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model},
203
- author={Ahmet Üstün and Viraat Aryabumi and Zheng-Xin Yong and Wei-Yin Ko and Daniel D'souza and Gbemileke Onilude and Neel Bhandari and Shivalika Singh and Hui-Lee Ooi and Amr Kayid and Freddie Vargus and Phil Blunsom and Shayne Longpre and Niklas Muennighoff and Marzieh Fadaee and Julia Kreutzer and Sara Hooker},
204
- journal={arXiv preprint arXiv:2402.07827},
205
- year={2024}
206
- }
207
- ```
208
-
209
- ## Languages Covered
210
-
211
- <details>
212
- <summary>Click to see Languages Covered</summary>
213
-
214
- Below is the list of languages used in finetuning the Aya Model. We group languages into higher-, mid-, and lower-resourcedness based on a language classification by [Joshi et. al, 2020](https://microsoft.github.io/linguisticdiversity/). For further details, we refer to our [paper](https://arxiv.org/abs/2402.07827)
215
-
216
- | ISO Code | Language Name | Script | Family | Subgrouping | Resourcedness |
217
- | :------- | :-------------- | :----------: | :-------------: | :---------------: | :-----------: |
218
- | afr | Afrikaans | Latin | Indo-European | Germanic | Mid |
219
- | amh | Amharic | Ge'ez | Afro-Asiatic | Semitic | Low |
220
- | ara | Arabic | Arabic | Afro-Asiatic | Semitic | High |
221
- | aze | Azerbaijani | Arabic/Latin | Turkic | Common Turkic | Low |
222
- | bel | Belarusian | Cyrillic | Indo-European | Balto-Slavic | Mid |
223
- | ben | Bengali | Bengali | Indo-European | Indo-Aryan | Mid |
224
- | bul | Bulgarian | Cyrillic | Indo-European | Balto-Slavic | Mid |
225
- | cat | Catalan | Latin | Indo-European | Italic | High |
226
- | ceb | Cebuano | Latin | Austronesian | Malayo-Polynesian | Mid |
227
- | ces | Czech | Latin | Indo-European | Balto-Slavic | High |
228
- | cym | Welsh | Latin | Indo-European | Celtic | Low |
229
- | dan | Danish | Latin | Indo-European | Germanic | Mid |
230
- | deu | German | Latin | Indo-European | Germanic | High |
231
- | ell | Greek | Greek | Indo-European | Graeco-Phrygian | Mid |
232
- | eng | English | Latin | Indo-European | Germanic | High |
233
- | epo | Esperanto | Latin | Constructed | Esperantic | Low |
234
- | est | Estonian | Latin | Uralic | Finnic | Mid |
235
- | eus | Basque | Latin | Basque | - | High |
236
- | fin | Finnish | Latin | Uralic | Finnic | High |
237
- | fil | Tagalog | Latin | Austronesian | Malayo-Polynesian | Mid |
238
- | fra | French | Latin | Indo-European | Italic | High |
239
- | fry | Western Frisian | Latin | Indo-European | Germanic | Low |
240
- | gla | Scottish Gaelic | Latin | Indo-European | Celtic | Low |
241
- | gle | Irish | Latin | Indo-European | Celtic | Low |
242
- | glg | Galician | Latin | Indo-European | Italic | Mid |
243
- | guj | Gujarati | Gujarati | Indo-European | Indo-Aryan | Low |
244
- | hat | Haitian Creole | Latin | Indo-European | Italic | Low |
245
- | hau | Hausa | Latin | Afro-Asiatic | Chadic | Low |
246
- | heb | Hebrew | Hebrew | Afro-Asiatic | Semitic | Mid |
247
- | hin | Hindi | Devanagari | Indo-European | Indo-Aryan | High |
248
- | hun | Hungarian | Latin | Uralic | - | High |
249
- | hye | Armenian | Armenian | Indo-European | Armenic | Low |
250
- | ibo | Igbo | Latin | Atlantic-Congo | Benue-Congo | Low |
251
- | ind | Indonesian | Latin | Austronesian | Malayo-Polynesian | Mid |
252
- | isl | Icelandic | Latin | Indo-European | Germanic | Low |
253
- | ita | Italian | Latin | Indo-European | Italic | High |
254
- | jav | Javanese | Latin | Austronesian | Malayo-Polynesian | Low |
255
- | jpn | Japanese | Japanese | Japonic | Japanesic | High |
256
- | kan | Kannada | Kannada | Dravidian | South Dravidian | Low |
257
- | kat | Georgian | Georgian | Kartvelian | Georgian-Zan | Mid |
258
- | kaz | Kazakh | Cyrillic | Turkic | Common Turkic | Mid |
259
- | khm | Khmer | Khmer | Austroasiatic | Khmeric | Low |
260
- | kir | Kyrgyz | Cyrillic | Turkic | Common Turkic | Low |
261
- | kor | Korean | Hangul | Koreanic | Korean | High |
262
- | kur | Kurdish | Latin | Indo-European | Iranian | Low |
263
- | lao | Lao | Lao | Tai-Kadai | Kam-Tai | Low |
264
- | lav | Latvian | Latin | Indo-European | Balto-Slavic | Mid |
265
- | lat | Latin | Latin | Indo-European | Italic | Mid |
266
- | lit | Lithuanian | Latin | Indo-European | Balto-Slavic | Mid |
267
- | ltz | Luxembourgish | Latin | Indo-European | Germanic | Low |
268
- | mal | Malayalam | Malayalam | Dravidian | South Dravidian | Low |
269
- | mar | Marathi | Devanagari | Indo-European | Indo-Aryan | Low |
270
- | mkd | Macedonian | Cyrillic | Indo-European | Balto-Slavic | Low |
271
- | mlg | Malagasy | Latin | Austronesian | Malayo-Polynesian | Low |
272
- | mlt | Maltese | Latin | Afro-Asiatic | Semitic | Low |
273
- | mon | Mongolian | Cyrillic | Mongolic-Khitan | Mongolic | Low |
274
- | mri | Maori | Latin | Austronesian | Malayo-Polynesian | Low |
275
- | msa | Malay | Latin | Austronesian | Malayo-Polynesian | Mid |
276
- | mya | Burmese | Myanmar | Sino-Tibetan | Burmo-Qiangic | Low |
277
- | nep | Nepali | Devanagari | Indo-European | Indo-Aryan | Low |
278
- | nld | Dutch | Latin | Indo-European | Germanic | High |
279
- | nor | Norwegian | Latin | Indo-European | Germanic | Low |
280
- | nso | Northern Sotho | Latin | Atlantic-Congo | Benue-Congo | Low |
281
- | nya | Chichewa | Latin | Atlantic-Congo | Benue-Congo | Low |
282
- | ory | Oriya | Oriya | Indo-European | Indo-Aryan | Low |
283
- | pan | Punjabi | Gurmukhi | Indo-European | Indo-Aryan | Low |
284
- | pes | Persian | Arabic | Indo-European | Iranian | High |
285
- | pol | Polish | Latin | Indo-European | Balto-Slavic | High |
286
- | por | Portuguese | Latin | Indo-European | Italic | High |
287
- | pus | Pashto | Arabic | Indo-European | Iranian | Low |
288
- | ron | Romanian | Latin | Indo-European | Italic | Mid |
289
- | rus | Russian | Cyrillic | Indo-European | Balto-Slavic | High |
290
- | sin | Sinhala | Sinhala | Indo-European | Indo-Aryan | Low |
291
- | slk | Slovak | Latin | Indo-European | Balto-Slavic | Mid |
292
- | slv | Slovenian | Latin | Indo-European | Balto-Slavic | Mid |
293
- | smo | Samoan | Latin | Austronesian | Malayo-Polynesian | Low |
294
- | sna | Shona | Latin | Indo-European | Indo-Aryan | Low |
295
- | snd | Sindhi | Arabic | Indo-European | Indo-Aryan | Low |
296
- | som | Somali | Latin | Afro-Asiatic | Cushitic | Low |
297
- | sot | Southern Sotho | Latin | Atlantic-Congo | Benue-Congo | Low |
298
- | spa | Spanish | Latin | Indo-European | Italic | High |
299
- | sqi | Albanian | Latin | Indo-European | Albanian | Low |
300
- | srp | Serbian | Cyrillic | Indo-European | Balto-Slavic | High |
301
- | sun | Sundanese | Latin | Austronesian | Malayo-Polynesian | Low |
302
- | swa | Swahili | Latin | Atlantic-Congo | Benue-Congo | Low |
303
- | swe | Swedish | Latin | Indo-European | Germanic | High |
304
- | tam | Tamil | Tamil | Dravidian | South Dravidian | Mid |
305
- | tel | Telugu | Telugu | Dravidian | South Dravidian | Low |
306
- | tgk | Tajik | Cyrillic | Indo-European | Iranian | Low |
307
- | tha | Thai | Thai | Tai-Kadai | Kam-Tai | Mid |
308
- | tur | Turkish | Latin | Turkic | Common Turkic | High |
309
- | twi | Twi | Latin | Atlantic-Congo | Niger-Congo | Low |
310
- | ukr | Ukrainian | Cyrillic | Indo-European | Balto-Slavic | Mid |
311
- | urd | Urdu | Arabic | Indo-European | Indo-Aryan | Mid |
312
- | uzb | Uzbek | Latin | Turkic | Common Turkic | Mid |
313
- | vie | Vietnamese | Latin | Austroasiatic | Vietic | High |
314
- | xho | Xhosa | Latin | Atlantic-Congo | Benue-Congo | Low |
315
- | yid | Yiddish | Hebrew | Indo-European | Germanic | Low |
316
- | yor | Yoruba | Latin | Atlantic-Congo | Benue-Congo | Low |
317
- | zho | Chinese | Han | Sino-Tibetan | Sinitic | High |
318
- | zul | Zulu | Latin | Atlantic-Congo | Benue-Congo | Low |
319
- </details>
320
-
321
- ## Model Card Contact
322
-
323
- For errors in this model card, contact Ahmet or Viraat, `{ahmet, viraat} at cohere dot com`.
 
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aya-fig1.png DELETED

Git LFS Details

  • SHA256: 52b18ad264847efa8f8f3947a5d845eb4856cf194af707a961a2e40318ac8108
  • Pointer size: 132 Bytes
  • Size of remote file: 1.23 MB
config.json CHANGED
@@ -1,33 +1,3 @@
1
- {
2
- "_name_or_path": "/home/patrick/t5/mt5-xxl",
3
- "architectures": [
4
- "T5ForConditionalGeneration"
5
- ],
6
- "classifier_dropout": 0.0,
7
- "d_ff": 10240,
8
- "d_kv": 64,
9
- "d_model": 4096,
10
- "decoder_start_token_id": 0,
11
- "dense_act_fn": "gelu_new",
12
- "dropout_rate": 0.1,
13
- "eos_token_id": 1,
14
- "feed_forward_proj": "gated-gelu",
15
- "initializer_factor": 1.0,
16
- "is_encoder_decoder": true,
17
- "is_gated_act": true,
18
- "layer_norm_epsilon": 1e-06,
19
- "model_type": "t5",
20
- "num_decoder_layers": 24,
21
- "num_heads": 64,
22
- "num_layers": 24,
23
- "output_past": true,
24
- "pad_token_id": 0,
25
- "relative_attention_max_distance": 128,
26
- "relative_attention_num_buckets": 32,
27
- "tie_word_embeddings": false,
28
- "tokenizer_class": "T5Tokenizer",
29
- "torch_dtype": "float32",
30
- "transformers_version": "4.37.2",
31
- "use_cache": true,
32
- "vocab_size": 250112
33
- }
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d2a04833c0037ef377dc99390ca5187a546fcd745145009130cfc5b1b13127a8
3
+ size 836
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
generation_config.json CHANGED
@@ -1,7 +1,3 @@
1
- {
2
- "_from_model_config": true,
3
- "decoder_start_token_id": 0,
4
- "eos_token_id": 1,
5
- "pad_token_id": 0,
6
- "transformers_version": "4.37.2"
7
- }
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e255d602baf228364ebfd4e787b9fa8a0f41d0ade452f76bdfd5cd57f7f9b8e4
3
+ size 142
 
 
 
 
tokenizer_config.json CHANGED
@@ -1,38 +1,3 @@
1
- {
2
- "added_tokens_decoder": {
3
- "0": {
4
- "content": "<pad>",
5
- "lstrip": false,
6
- "normalized": false,
7
- "rstrip": false,
8
- "single_word": false,
9
- "special": true
10
- },
11
- "1": {
12
- "content": "</s>",
13
- "lstrip": false,
14
- "normalized": false,
15
- "rstrip": false,
16
- "single_word": false,
17
- "special": true
18
- },
19
- "2": {
20
- "content": "<unk>",
21
- "lstrip": false,
22
- "normalized": false,
23
- "rstrip": false,
24
- "single_word": false,
25
- "special": true
26
- }
27
- },
28
- "additional_special_tokens": [],
29
- "clean_up_tokenization_spaces": true,
30
- "eos_token": "</s>",
31
- "extra_ids": 0,
32
- "legacy": true,
33
- "model_max_length": 1000000000000000019884624838656,
34
- "pad_token": "<pad>",
35
- "sp_model_kwargs": {},
36
- "tokenizer_class": "T5Tokenizer",
37
- "unk_token": "<unk>"
38
- }
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9dd067894a66be11f993247863053a1550058d8b07ae7159049efbfef8195ce3
3
+ size 833