juletxara commited on
Commit
5446f63
1 Parent(s): f2ffd7b

add tokenizer and update readme

Browse files
Files changed (6) hide show
  1. README.md +475 -0
  2. latxa.jpeg +0 -0
  3. special_tokens_map.json +23 -0
  4. tokenizer.json +0 -0
  5. tokenizer.model +3 -0
  6. tokenizer_config.json +35 -0
README.md CHANGED
@@ -1,3 +1,478 @@
1
  ---
2
  license: llama2
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama2
3
+ datasets:
4
+ - HiTZ/euscrawl
5
+ language:
6
+ - eu
7
+ - en
8
+ metrics:
9
+ - accuracy
10
+ - f1
11
+ - perplexity
12
+ pipeline_tag: text-generation
13
  ---
14
+
15
+ # **Model Card for Latxa 13b**
16
+
17
+ ![Latxa](latxa.jpeg)
18
+
19
+ Latxa is a collection of foundation models specifically tuned for Basque. Based on Meta’s LLaMA 2 model family, these models were further trained with Euscrawl, a highly curated Basque corpora ([Artetxe et al., 2022](https://aclanthology.org/2022.emnlp-main.499/)). Ranging from 7 billion to 70 billion parameters, these models are currently the biggest and best-performing LLMs built for Basque. This is the 13b repository, links to other models can be found in the [Latxa Collection](https://huggingface.co/collections/HiTZ/latxa-65a697e6838b3acc53677304).
20
+
21
+
22
+ # **Model Details**
23
+
24
+
25
+ ## **Model Description**
26
+
27
+ Latxa is a family of Large Language Models (LLM) based on Meta’s [LLaMA models](https://huggingface.co/meta-llama). Current LLMs exhibit incredible performance for high-resource languages such as English, but, in the case of Basque and other low-resource languages, their performance is close to a random guesser. These limitations widen the gap between high- and low-resource languages when it comes to digital development. We present Latxa to overcome these limitations and promote the development of LLM-based technology and research for the Basque language. Latxa models follow the same architecture as their original counterparts and were further trained in Euscrawl v1 ([Artetxe et al., 2022](https://aclanthology.org/2022.emnlp-main.499/)), a high-quality Basque corpora.
28
+
29
+ The models are released in three sizes: 7B, 13B and 70B.
30
+
31
+
32
+
33
+ * **Developed by:** HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
34
+ * **Model type:** Language model
35
+ * **Language(s) (NLP):** en, eu
36
+ * **License:** llama2
37
+ * **Parent Model:** meta-llama/Llama-2-13b
38
+ * **Contact:** [email protected]
39
+
40
+
41
+ ## **Getting started**
42
+
43
+ Use the code below to get started with the model.
44
+
45
+ ```python
46
+
47
+ from transformers import pipeline
48
+
49
+ pipe = pipeline("text-generation", model=”HiTZ/latxa-13b-v1”)
50
+
51
+ text = "Euskara adimen artifizialera iritsi da!"
52
+
53
+ pipe(text, max_new_tokens=50, num_beams=5)
54
+
55
+ >> [
56
+ {
57
+ 'generated_text': 'Euskara adimen artifizialera iritsi da!\nEuskararen eta adimen artifizialaren arteko harremana aspaldikoa da,'
58
+ ' baina azken urteotan aurrerapauso handiak eman dira arlo horretan'
59
+ }
60
+ ]
61
+
62
+ ```
63
+
64
+
65
+ # **Uses**
66
+
67
+ Latxa models are intended to be used with Basque data; for any other language the performance is not guaranteed. Same as the original, Latxa inherits the [LLaMA-2 License](https://ai.meta.com/llama/license/) which allows for commercial and research use.
68
+
69
+
70
+ ## **Direct Use**
71
+
72
+ Latxa family models are pre-trained LLMs without any task-specific or instruction fine-tuning. That is, the model can either be prompted to perform a specific task or further fine-tuned for specific use cases.
73
+
74
+
75
+ ## **Out-of-Scope Use**
76
+
77
+ The model was not fine-tuned to follow instructions or to work as a chat assistant, therefore, this kind of usage is not tested nor recommended.
78
+
79
+
80
+ # **Bias, Risks, and Limitations**
81
+
82
+ In an effort to alleviate the potentially disturbing or harmful content, Latxa has been trained on carefully selected and processed data which comes mainly from local media, national/regional newspapers, encyclopedias and blogs (see Euscrawl below). Still, the model is based on LLaMA models and can potentially carry the same bias, risk and limitations.
83
+
84
+ Please see the LLaMA’s _Ethical Considerations and Limitations _for further information.
85
+
86
+
87
+ # **Training Details**
88
+
89
+
90
+ ## **Training Data**
91
+
92
+ The models were trained on EusCrawl v1, a high-quality corpus for Basque comprising 1.72M documents, 288M words, totalling 2.1GiB of uncompressed text. EusCrawl was built using ad-hoc scrapers to extract text from 33 Basque websites with high-quality content, resulting in cleaner text compared to general-purpose approaches.
93
+
94
+ See more details in the [EusCrawl](https://huggingface.co/datasets/HiTZ/euscrawl) dataset card.
95
+
96
+ Additionally, 100K documents of English data randomly selected from the [Pile](https://huggingface.co/datasets/EleutherAI/pile) dataset were also included to avoid catastrophic forgetting.
97
+
98
+
99
+ ## **Training Procedure**
100
+
101
+ The models were trained using the GPT-Neox library on the HPC CINECA computing cluster. All the models were approximately trained with an effective batch size of 2M tokens for 1000 to 2000 steps.
102
+
103
+
104
+ <table>
105
+ <tr>
106
+ <td>Model
107
+ </td>
108
+ <td>Steps
109
+ </td>
110
+ <td>Sequence length
111
+ </td>
112
+ <td>Effective Batch size
113
+ </td>
114
+ <td>Total tokens
115
+ </td>
116
+ <td>GPU hours
117
+ </td>
118
+ </tr>
119
+ <tr>
120
+ <td>Latxa 7B
121
+ </td>
122
+ <td><p style="text-align: right">
123
+ 2000</p>
124
+
125
+ </td>
126
+ <td><p style="text-align: right">
127
+ 4096</p>
128
+
129
+ </td>
130
+ <td><p style="text-align: right">
131
+ 2M tokens/step</p>
132
+
133
+ </td>
134
+ <td><p style="text-align: right">
135
+ 4B</p>
136
+
137
+ </td>
138
+ <td><p style="text-align: right">
139
+ 359.2h</p>
140
+
141
+ </td>
142
+ </tr>
143
+ <tr>
144
+ <td>Latxa 13B
145
+ </td>
146
+ <td><p style="text-align: right">
147
+ 1000</p>
148
+
149
+ </td>
150
+ <td><p style="text-align: right">
151
+ 4096</p>
152
+
153
+ </td>
154
+ <td><p style="text-align: right">
155
+ 2M tokens/step</p>
156
+
157
+ </td>
158
+ <td><p style="text-align: right">
159
+ 2B</p>
160
+
161
+ </td>
162
+ <td><p style="text-align: right">
163
+ 468.8h</p>
164
+
165
+ </td>
166
+ </tr>
167
+ <tr>
168
+ <td>Latxa 70B
169
+ </td>
170
+ <td><p style="text-align: right">
171
+ 1680</p>
172
+
173
+ </td>
174
+ <td><p style="text-align: right">
175
+ 4096</p>
176
+
177
+ </td>
178
+ <td><p style="text-align: right">
179
+ 2M tokens/step</p>
180
+
181
+ </td>
182
+ <td><p style="text-align: right">
183
+ 3.4B</p>
184
+
185
+ </td>
186
+ <td><p style="text-align: right">
187
+ *6475.52h</p>
188
+
189
+ </td>
190
+ </tr>
191
+ </table>
192
+
193
+
194
+ * indicates the time for the entire training process (2000 steps), however the weights of the step 1680 are shared as it is the best checkpoint according to validation loss.
195
+
196
+
197
+ # **Evaluation**
198
+
199
+ We evaluated the models on zero-shot and few-shot settings on generative, multiple-choice and classification tasks. We used the basque partitions of each dataset.
200
+
201
+
202
+ ## **Testing Data, Factors & Metrics**
203
+
204
+
205
+ ### **Testing Data**
206
+
207
+
208
+
209
+ * **Belebele** ([Bandarkar et al.](https://arxiv.org/abs/2308.16884)): Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. We evaluated the model in a 5-shot fashion.
210
+ * Data card: [https://huggingface.co/datasets/facebook/belebele](https://huggingface.co/datasets/facebook/belebele)
211
+ * **X-StoryCloze** ([Lin et al.](https://arxiv.org/abs/2112.10668)): XStoryCloze consists of the professionally translated version of the English StoryCloze dataset to 10 non-English languages. Story Cloze is a commonsense reasoning dataset which consists of choosing the correct ending to a four-sentence story. We evaluated the model in a 0-shot fashion.
212
+ * Data card: [https://huggingface.co/datasets/juletxara/xstory_cloze](https://huggingface.co/datasets/juletxara/xstory_cloze)
213
+ * **BasqueGLUE** ([Urbizu et al.](https://aclanthology.org/2022.lrec-1.172.pdf)): BasqueGLUE is a NLU benchmark for Basque. We evaluated the model in a 5-shot fashion on the following tasks:
214
+ * Data card:[ https://huggingface.co/datasets/orai-nlp/basqueGLUE](https://huggingface.co/datasets/orai-nlp/basqueGLUE).
215
+ * Tasks:
216
+ * **BEC2016eu**: Sentiment analysis on tweets about the 2016 Basque elections campaign.
217
+ * **VaxxStance**: Stance detection on tweets around the anti-vaccine movement.
218
+ * **BTHCv2**: Topic classification of news extracts with 12 categories.
219
+ * **EpecKorrefBin**: Correference detection task similar to WSC.
220
+ * **QNLIeu**: Q&A NLI built from the Basque Wikipedia.
221
+ * **WiCeu**: Basque Word-in-Context task.
222
+
223
+
224
+ ### **Metrics**
225
+
226
+
227
+
228
+ * **Accuracy**: Belebele, X-StoryCloze, EpecKorrefBin, QNLI-eu, and, WiC-eu
229
+ * **Micro F1**: BEC2016-eu and BHTCv2
230
+ * **Macro F1**: VaxxStance (favor & against)
231
+
232
+
233
+ ## **Results**
234
+
235
+ The model was evaluated using the LM Evaluation harness library from Eleuther AI. In order to reproduce our results please refer to our [fork](https://github.com/naiarapm/lm-evaluation-harness/tree/basqueglue) that includes the implementation for the mentioned datasets.
236
+
237
+
238
+ <table>
239
+ <tr>
240
+ <td><strong>Model</strong>
241
+ </td>
242
+ <td><strong>Belebele</strong>
243
+ </td>
244
+ <td><strong>X-StoryCloze</strong>
245
+ </td>
246
+ <td><strong>BEC</strong>
247
+ </td>
248
+ <td><strong>Vaxx</strong>
249
+ </td>
250
+ <td><strong>BHTC</strong>
251
+ </td>
252
+ <td><strong>coref</strong>
253
+ </td>
254
+ <td><strong>QNLI</strong>
255
+ </td>
256
+ <td><strong>WiC</strong>
257
+ </td>
258
+ <td><strong>Average</strong>
259
+ </td>
260
+ </tr>
261
+ <tr>
262
+ <td>Random
263
+ </td>
264
+ <td>25.00
265
+ </td>
266
+ <td>50.00
267
+ </td>
268
+ <td>33.33
269
+ </td>
270
+ <td>33.33
271
+ </td>
272
+ <td>8.33
273
+ </td>
274
+ <td>50.00
275
+ </td>
276
+ <td>50.00
277
+ </td>
278
+ <td>50.00
279
+ </td>
280
+ <td>37.50
281
+ </td>
282
+ </tr>
283
+ <tr>
284
+ <td>LLaMA 2 7B
285
+ </td>
286
+ <td>26.22
287
+ </td>
288
+ <td>50.43
289
+ </td>
290
+ <td>41.63
291
+ </td>
292
+ <td>18.60
293
+ </td>
294
+ <td>20.06
295
+ </td>
296
+ <td>50.94
297
+ </td>
298
+ <td>48.32
299
+ </td>
300
+ <td>49.64
301
+ </td>
302
+ <td>38.23
303
+ </td>
304
+ </tr>
305
+ <tr>
306
+ <td>LLaMA 2 13B
307
+ </td>
308
+ <td>32.00
309
+ </td>
310
+ <td>50.63
311
+ </td>
312
+ <td>41.09
313
+ </td>
314
+ <td>18.25
315
+ </td>
316
+ <td>27.35
317
+ </td>
318
+ <td>49.23
319
+ </td>
320
+ <td>48.74
321
+ </td>
322
+ <td>49.21
323
+ </td>
324
+ <td>39.56
325
+ </td>
326
+ </tr>
327
+ <tr>
328
+ <td>LLaMA 2 70B
329
+ </td>
330
+ <td>33.56
331
+ </td>
332
+ <td>51.62
333
+ </td>
334
+ <td>47.47
335
+ </td>
336
+ <td>21.01
337
+ </td>
338
+ <td>31.01
339
+ </td>
340
+ <td>52.98
341
+ </td>
342
+ <td>51.26
343
+ </td>
344
+ <td>51.57
345
+ </td>
346
+ <td>42.56
347
+ </td>
348
+ </tr>
349
+ <tr>
350
+ <td>BLOOM 7B
351
+ </td>
352
+ <td>27.00
353
+ </td>
354
+ <td>57.18
355
+ </td>
356
+ <td>37.94
357
+ </td>
358
+ <td>20.72
359
+ </td>
360
+ <td>39.10
361
+ </td>
362
+ <td>48.21
363
+ </td>
364
+ <td>47.48
365
+ </td>
366
+ <td>47.57
367
+ </td>
368
+ <td>40.65
369
+ </td>
370
+ </tr>
371
+ <tr>
372
+ <td>XGLM 7B
373
+ </td>
374
+ <td>23.88
375
+ </td>
376
+ <td>57.71
377
+ </td>
378
+ <td>39.94
379
+ </td>
380
+ <td>21.58
381
+ </td>
382
+ <td>36.73
383
+ </td>
384
+ <td>50.94
385
+ </td>
386
+ <td>50.42
387
+ </td>
388
+ <td>49.21
389
+ </td>
390
+ <td>41.30
391
+ </td>
392
+ </tr>
393
+ <tr>
394
+ <td><strong>Latxa 7B</strong>
395
+ </td>
396
+ <td>35.67
397
+ </td>
398
+ <td>63.13
399
+ </td>
400
+ <td>55.61
401
+ </td>
402
+ <td>45.93
403
+ </td>
404
+ <td>44.44
405
+ </td>
406
+ <td>50.43
407
+ </td>
408
+ <td>55.04
409
+ </td>
410
+ <td>50.14
411
+ </td>
412
+ <td>50.05
413
+ </td>
414
+ </tr>
415
+ <tr>
416
+ <td><strong>Latxa 13B</strong>
417
+ </td>
418
+ <td>53.56
419
+ </td>
420
+ <td>65.85
421
+ </td>
422
+ <td>53.23
423
+ </td>
424
+ <td>48.66
425
+ </td>
426
+ <td><strong>53.61</strong>
427
+ </td>
428
+ <td>62.52
429
+ </td>
430
+ <td>57.14
431
+ </td>
432
+ <td>54.21
433
+ </td>
434
+ <td>56.10
435
+ </td>
436
+ </tr>
437
+ <tr>
438
+ <td><strong>Latxa 70B</strong>
439
+ </td>
440
+ <td><strong>71.78</strong>
441
+ </td>
442
+ <td><strong>67.57</strong>
443
+ </td>
444
+ <td><strong>63.52</strong>
445
+ </td>
446
+ <td><strong>48.95</strong>
447
+ </td>
448
+ <td>49.51
449
+ </td>
450
+ <td><strong>79.90</strong>
451
+ </td>
452
+ <td><strong>58.82</strong>
453
+ </td>
454
+ <td><strong>55.50</strong>
455
+ </td>
456
+ <td><strong>61.94</strong>
457
+ </td>
458
+ </tr>
459
+ </table>
460
+
461
+
462
+
463
+ # **Environmental Impact**
464
+
465
+ Carbon emissions are estimated using the[ Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in[ Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
466
+
467
+
468
+
469
+ * **Hardware Type:** HPC Cluster, 4x A100 64Gb nodes
470
+ * **Hours used:** 359.2h + 468.8h + 6475.52h = 7303.52h
471
+ * **Compute cluster:** CINECA HPC
472
+ * **Compute Region:** Italy
473
+ * **Carbon Emitted:** 673.75kg CO<sub>2</sub> eq
474
+
475
+
476
+ # **Acknowledgements**
477
+
478
+ This work has been partially supported by the Basque Government (IKER-GAITU project). The models were trained on the Leonardo supercomputer at CINECA under the EuroHPC Joint Undertaking, project EHPC-EXT-2023E01-013.
latxa.jpeg ADDED
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
3
+ size 499723
tokenizer_config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "bos_token": {
5
+ "__type": "AddedToken",
6
+ "content": "<s>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "clean_up_tokenization_spaces": false,
13
+ "eos_token": {
14
+ "__type": "AddedToken",
15
+ "content": "</s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "legacy": false,
22
+ "model_max_length": 1000000000000000019884624838656,
23
+ "pad_token": null,
24
+ "padding_side": "right",
25
+ "sp_model_kwargs": {},
26
+ "tokenizer_class": "LlamaTokenizer",
27
+ "unk_token": {
28
+ "__type": "AddedToken",
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ }
35
+ }