Update README.md
Browse files
README.md
CHANGED
@@ -1,85 +1,58 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
3 |
---
|
|
|
4 |
# Swahili-English Translation Model
|
5 |
|
6 |
## Model Details
|
7 |
|
8 |
- **Pre-trained Model**: Rogendo/sw-en
|
9 |
-
-
|
|
|
|
|
|
|
10 |
|
11 |
-
- Transformer architecture used
|
12 |
-
- Trained on a 210000 corpus pairs
|
13 |
-
- Pre-trained Helsinki-NLP/opus-mt-en-swc
|
14 |
-
- 2 models to enforce biderectional translation
|
15 |
### Model Description
|
16 |
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
|
21 |
- **Developed by:** Peter Rogendo, Frederick Kioko
|
22 |
-
- **Model
|
23 |
-
- **
|
24 |
- **License:** Distributed under the MIT License
|
25 |
-
- **Finetuned from model [Helsinki-NLP/opus-mt-en-swc]:** [This pre-trained model was re-trained on a swahili-english sentence pairs that were collected across Kenya. Swahili is the national language and is among the top three of the most spoken language in Africa. The sentences that were used to train this model were 210000 in total.]
|
26 |
|
27 |
-
|
|
|
|
|
|
|
|
|
28 |
- **Package**: WikiMatrix.en-sw in Moses format
|
29 |
-
- **Website**: [WikiMatrix](http://opus.nlpl.eu/WikiMatrix-v1.php)
|
30 |
-
- **Release**: v1
|
31 |
-
- **Release Date**: Wed Nov 4 15:07:29 EET 2020
|
32 |
- **License**: [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode)
|
33 |
-
- **Citation**: Holger Schwenk
|
34 |
|
35 |
-
- **
|
36 |
- **Package**: ParaCrawl.en-sw in Moses format
|
37 |
-
- **Website**: [ParaCrawl](http://opus.nlpl.eu/ParaCrawl-v9.php)
|
38 |
-
- **Release**: v9
|
39 |
-
- **Release Date**: Fri Mar 25 12:20:25 EET 2022
|
40 |
- **License**: [CC0](http://paracrawl.eu/download.html)
|
41 |
-
- **Acknowledgement**: Please acknowledge the ParaCrawl project at [ParaCrawl](http://paracrawl.eu)
|
42 |
|
43 |
-
- **
|
44 |
- **Package**: tico-19.en-sw in Moses format
|
45 |
-
- **Website**: [TICO-19](http://opus.nlpl.eu/tico-19-v2020-10-28.php)
|
46 |
-
- **Release**: v2020-10-28
|
47 |
-
- **Release Date**: Wed Oct 28 08:44:31 EET 2020
|
48 |
- **License**: [CC0](https://tico-19.github.io/LICENSE.md)
|
49 |
-
- **Citation**: J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS.
|
50 |
-
|
51 |
-
## Model Description
|
52 |
-
|
53 |
-
- **Developed By**: Bildad Otieno
|
54 |
-
- **Model Type**: Transformer
|
55 |
-
- **Language(s)**: Swahili and English
|
56 |
-
- **License**: Distributed under the MIT License
|
57 |
-
- **Training Data**: The model was fine-tuned using a collection of datasets from OPUS, including WikiMatrix, ParaCrawl, and TICO-19. The datasets provide a diverse range of translation examples between Swahili and English.
|
58 |
-
|
59 |
-
# Use a pipeline as a high-level helper
|
60 |
-
|
61 |
-
from transformers import pipeline
|
62 |
-
|
63 |
-
# Initialize the translation pipeline
|
64 |
-
translator = pipeline("translation", model="Bildad/Swahili-English_Translation")
|
65 |
-
|
66 |
-
# Translate text
|
67 |
-
translation = translator("Habari yako?")[0]
|
68 |
-
translated_text = translation["translation_text"]
|
69 |
-
|
70 |
-
print(translated_text)
|
71 |
|
72 |
-
|
73 |
|
74 |
-
|
75 |
-
|
76 |
-
tokenizer = AutoTokenizer.from_pretrained("Bildad/Swahili-English_Translation")
|
77 |
-
model = AutoModelForSeq2SeqLM.from_pretrained("Bildad/Swahili-English_Translation")
|
78 |
|
79 |
-
|
|
|
80 |
|
81 |
-
|
|
|
82 |
|
83 |
-
|
|
|
|
|
84 |
|
85 |
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
library_name: transformers
|
4 |
---
|
5 |
+
|
6 |
# Swahili-English Translation Model
|
7 |
|
8 |
## Model Details
|
9 |
|
10 |
- **Pre-trained Model**: Rogendo/sw-en
|
11 |
+
- **Architecture**: Transformer
|
12 |
+
- **Training Data**: Trained on 210,000 Swahili-English corpus pairs
|
13 |
+
- **Base Model**: Helsinki-NLP/opus-mt-en-swc
|
14 |
+
- **Training Method**: Fine-tuned with an emphasis on bidirectional translation between Swahili and English.
|
15 |
|
|
|
|
|
|
|
|
|
16 |
### Model Description
|
17 |
|
18 |
+
This Swahili-English translation model was developed to handle translations between Swahili, one of Africa's most spoken languages, and English. It was trained on a diverse dataset sourced from OPUS, leveraging the Transformer architecture for effective translation.
|
|
|
|
|
19 |
|
20 |
- **Developed by:** Peter Rogendo, Frederick Kioko
|
21 |
+
- **Model Type:** Transformer
|
22 |
+
- **Languages:** Swahili, English
|
23 |
- **License:** Distributed under the MIT License
|
|
|
24 |
|
25 |
+
### Training Data
|
26 |
+
|
27 |
+
The model was fine-tuned on the following datasets:
|
28 |
+
|
29 |
+
- **WikiMatrix:**
|
30 |
- **Package**: WikiMatrix.en-sw in Moses format
|
|
|
|
|
|
|
31 |
- **License**: [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode)
|
32 |
+
- **Citation**: Holger Schwenk et al., WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia, arXiv, July 2019.
|
33 |
|
34 |
+
- **ParaCrawl:**
|
35 |
- **Package**: ParaCrawl.en-sw in Moses format
|
|
|
|
|
|
|
36 |
- **License**: [CC0](http://paracrawl.eu/download.html)
|
37 |
+
- **Acknowledgement**: Please acknowledge the ParaCrawl project at [ParaCrawl](http://paracrawl.eu).
|
38 |
|
39 |
+
- **TICO-19:**
|
40 |
- **Package**: tico-19.en-sw in Moses format
|
|
|
|
|
|
|
41 |
- **License**: [CC0](https://tico-19.github.io/LICENSE.md)
|
42 |
+
- **Citation**: J. Tiedemann, 2012, Parallel Data, Tools, and Interfaces in OPUS.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
|
44 |
+
## Usage
|
45 |
|
46 |
+
### Using a Pipeline as a High-Level Helper
|
|
|
|
|
|
|
47 |
|
48 |
+
```python
|
49 |
+
from transformers import pipeline
|
50 |
|
51 |
+
# Initialize the translation pipeline
|
52 |
+
translator = pipeline("translation", model="Bildad/Swahili-English_Translation")
|
53 |
|
54 |
+
# Translate text
|
55 |
+
translation = translator("Habari yako?")[0]
|
56 |
+
translated_text = translation["translation_text"]
|
57 |
|
58 |
+
print(translated_text)
|