Translation
Transformers
Safetensors
m2m_100
text2text-generation
Inference Endpoints
File size: 3,679 Bytes
ddf42a7
 
676efa5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ddf42a7
 
676efa5
ddf42a7
676efa5
 
 
 
ddf42a7
 
 
 
 
 
 
676efa5
 
 
 
 
ddf42a7
 
 
 
 
676efa5
 
 
ddf42a7
 
 
 
676efa5
ddf42a7
 
676efa5
ddf42a7
 
676efa5
ddf42a7
676efa5
 
ddf42a7
 
676efa5
ddf42a7
 
 
676efa5
ddf42a7
 
 
 
 
676efa5
 
ddf42a7
 
 
 
 
 
676efa5
 
ddf42a7
676efa5
ddf42a7
 
 
676efa5
ddf42a7
676efa5
 
 
 
 
 
 
 
 
 
 
ddf42a7
 
676efa5
ddf42a7
 
 
676efa5
ddf42a7
 
 
676efa5
ddf42a7
676efa5
 
 
ddf42a7
 
 
676efa5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
library_name: transformers
license: cc-by-nc-4.0
language:
- myv
- ru
- ar
- en
- et
- fr
- de
- kk
- ch
- zh
- mn
- es
- tr
- uk
- uz
base_model:
- facebook/nllb-200-distilled-600M
datasets:
- slone/myv_ru_2022
- slone/e-mordovia-articles-2023
pipeline_tag: translation
---

# Model Card for NLLB-with-myv-v2024 (a translation model for Erzya)

This is a version of the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) machine translation model
with one added language: Erzya (the new language code is `myv_Cyrl`).
It can probably translate from all 202 NLLB languages, but it fine-tuned with the focus on Erzya, Russian, and, to a lesser extent, 
on Arabic, English, Estonian, Finnish, French, German, Kazakh, Mandarin, Mongolian, Spanish, Turkish, Ukrainian, and Uzbek.


## Model Details

### Model Description


- **Developed by:** Isai Gordeev, Sergey Kuldin and David Dale
- **Model type:** Encoder-decoder transformer
- **Language(s) (NLP):** Erzya, Russian, and all the 202 NLLB languages.
- **License:** CC-BY-NC-4.0
- **Finetuned from model:** [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** will be published later
- **Paper:** will be published later
- **Demo:** https://lango.to/ (it is powered by a similar model)

## Uses

### Direct Use
Translation between Erzya, Russian, and potentially other languages. The model seems to be SOTA for translating into Erzya.

### Out-of-Scope Use
Translation between other NLLB languages, not inclusing Erzya as source or target.

## Bias, Risks, and Limitations
The model is not producing the most fluent translations into Russian and other high-resourced languages.

Its translations into Erzya seem to be better than anything else, but may still include inaccurate or ungrammatical translations,
so they should be always manually reviewed before any high-responsibility use.

### Recommendations
Please contact the authors for any substantial recommendation.

## How to Get Started with the Model

See the NLLB generation code: https://huggingface.co/docs/transformers/v4.44.2/en/model_doc/nllb#generating-with-nllb.

## Training Details

### Training Data

- https://huggingface.co/datasets/slone/myv_ru_2022
- https://huggingface.co/datasets/slone/e-mordovia-articles-2023

### Training Procedure


#### Preprocessing [optional]

The preprocessing code is adapted from  the Stopes repo of the NLLB team:
https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/monolingual_line_processor.py#L214

It performs punctuation normalization, nonprintable character removal and Unicode normalization.

#### Training Hyperparameters

The tokenizer of the model was updated with 6209 new Erzya tokens. They were initialized with the average embeddings of the old tokens from which they are combined.

- training regime: `fp32`
- batch_size: 6
- grad_acc_steps: 4
- max_length: 128
- optimizer: Adafactor
- lr: 1e-4
- clip_threshold=1.0
- weight_decay: 1e-3
- warmup_steps: 3_000 (with a linear warmup from 0)
- training_steps: 220_000
- weight_loss_coef: 100 (a coefficient for the additional penalty, MSE between the embeddings of old tokens and their values for NLLB-200)


## Technical Specifications

### Model Architecture and Objective

A standard encoder-decoder translation model with cross-entropy loss. 

### Compute Infrastructure

Google Colab with a T4 GPU.

```
pip install --upgrade sentencepiece transformers==4.40 datasets sacremoses editdistance sacrebleu razdel ctranslate2
```

## Model Card Contact

@cointegrated