README file updated
Browse files
README.md
CHANGED
@@ -1,3 +1,69 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- eu
|
6 |
+
metrics:
|
7 |
+
- BLEU
|
8 |
+
- TER
|
9 |
+
tags:
|
10 |
+
- text2text-generation
|
11 |
+
- open-nmt
|
12 |
+
- pytorch
|
13 |
---
|
14 |
+
|
15 |
+
# Itzune v1.9 EN -> EU machine translation argos model
|
16 |
+
|
17 |
+
This model was trained using [argostrain](https://github.com/argosopentech/argos-train) training scripts with 11,542,706 English to Basque parallel strings extracted from datasets obtained directly from the [Opus project](https://opus.nlpl.eu/).
|
18 |
+
|
19 |
+
## Model description
|
20 |
+
|
21 |
+
|
22 |
+
- **Developed by:** argostranslate
|
23 |
+
- **Model type:** traslation
|
24 |
+
- **Model version:** v1.9
|
25 |
+
- **Source Language:** English
|
26 |
+
- **Target Language:** Basque
|
27 |
+
- **License:** MIT
|
28 |
+
|
29 |
+
## Training Data
|
30 |
+
|
31 |
+
The English-Basque parallel sentences were collected from the following datasets:
|
32 |
+
|
33 |
+
| Dataset | Sentences before cleaning |
|
34 |
+
|----------------------|--------------------------:|
|
35 |
+
| CCMatrix v1 | 7,788,871 |
|
36 |
+
| OpenSubtitles v2018 | 805,780 |
|
37 |
+
| XLEnt v1.2 | 800,631 |
|
38 |
+
| GNOME v1 | 652,298 |
|
39 |
+
| HPLT v1.1 | 610,694 |
|
40 |
+
| EhuHac v1 | 585,210 |
|
41 |
+
| WikiMatrix v1 | 119,480 |
|
42 |
+
| KDE4 v2 | 100,160 |
|
43 |
+
| wikimedia v20230407 | 60,990 |
|
44 |
+
| bible-uedin v1 | 15,893 |
|
45 |
+
| Tatoeba v2023-04-12 | 2,070 |
|
46 |
+
| Wiktionary | 629 |
|
47 |
+
| **Total** | **11,542,706** |
|
48 |
+
|
49 |
+
### Evaluation results
|
50 |
+
Below are the evaluation results on the machine translation from English to Basque compared to [Google Translate](https://translate.google.com/), [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [mt-hitz-en-eu](https://huggingface.co/HiTZ/mt-hitz-en-eu):
|
51 |
+
|
52 |
+
#### BLEU scores
|
53 |
+
|
54 |
+
| Test set |Google Translate | NLLB 3.3 | mt-hitz-en-eu | itzune 1.9 |
|
55 |
+
|----------------------|-----------------|----------|---------------|------------|
|
56 |
+
| Flores 200 devtest | **20.5** | 13.3 | 19.2 | 17.0 |
|
57 |
+
| TaCON | **12.1** | 9.4 | 8.8 | - |
|
58 |
+
| NTREX | **15.7** | 8.0 | 14.5 | - |
|
59 |
+
| Average | **16.1** | 10.2 | 14.2 | - |
|
60 |
+
|
61 |
+
#### TER scores
|
62 |
+
|
63 |
+
| Test set |Google Translate | NLLB 3.3 | mt-hitz-en-eu | itzune 1.9 |
|
64 |
+
|----------------------|-----------------|----------|---------------|------------|
|
65 |
+
| Flores 200 devtest |**59.5** | 70.4 | 65.0 | 70.1 |
|
66 |
+
| TaCON |**69.5** | 75.3 | 76.8 | - |
|
67 |
+
| NTREX |**65.8** | 81.6 | 66.7 | - |
|
68 |
+
| Average |**64.9** | 75.8 | 68.2 | - |
|
69 |
+
|