|
--- |
|
library_name: transformers |
|
license: cc-by-nc-4.0 |
|
language: |
|
- myv |
|
- ru |
|
- ar |
|
- en |
|
- et |
|
- fr |
|
- de |
|
- kk |
|
- ch |
|
- zh |
|
- mn |
|
- es |
|
- tr |
|
- uk |
|
- uz |
|
base_model: |
|
- facebook/nllb-200-distilled-600M |
|
datasets: |
|
- slone/myv_ru_2022 |
|
- slone/e-mordovia-articles-2023 |
|
pipeline_tag: translation |
|
--- |
|
|
|
# Model Card for NLLB-with-myv-v2024 (a translation model for Erzya) |
|
|
|
This is a version of the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) machine translation model |
|
with one added language: Erzya (the new language code is `myv_Cyrl`). |
|
It can probably translate from all 202 NLLB languages, but it fine-tuned with the focus on Erzya, Russian, and, to a lesser extent, |
|
on Arabic, English, Estonian, Finnish, French, German, Kazakh, Mandarin, Mongolian, Spanish, Turkish, Ukrainian, and Uzbek. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** Isai Gordeev, Sergey Kuldin and David Dale |
|
- **Model type:** Encoder-decoder transformer |
|
- **Language(s) (NLP):** Erzya, Russian, and all the 202 NLLB languages. |
|
- **License:** CC-BY-NC-4.0 |
|
- **Finetuned from model:** [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** will be published later |
|
- **Paper:** will be published later |
|
- **Demo:** https://lango.to/ (it is powered by a similar model) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
Translation between Erzya, Russian, and potentially other languages. The model seems to be SOTA for translating into Erzya. |
|
|
|
### Out-of-Scope Use |
|
Translation between other NLLB languages, not inclusing Erzya as source or target. |
|
|
|
## Bias, Risks, and Limitations |
|
The model is not producing the most fluent translations into Russian and other high-resourced languages. |
|
|
|
Its translations into Erzya seem to be better than anything else, but may still include inaccurate or ungrammatical translations, |
|
so they should be always manually reviewed before any high-responsibility use. |
|
|
|
### Recommendations |
|
Please contact the authors for any substantial recommendation. |
|
|
|
## How to Get Started with the Model |
|
|
|
See the NLLB generation code: https://huggingface.co/docs/transformers/v4.44.2/en/model_doc/nllb#generating-with-nllb. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
- https://huggingface.co/datasets/slone/myv_ru_2022 |
|
- https://huggingface.co/datasets/slone/e-mordovia-articles-2023 |
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing [optional] |
|
|
|
The preprocessing code is adapted from the Stopes repo of the NLLB team: |
|
https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/monolingual_line_processor.py#L214 |
|
|
|
It performs punctuation normalization, nonprintable character removal and Unicode normalization. |
|
|
|
#### Training Hyperparameters |
|
|
|
The tokenizer of the model was updated with 6209 new Erzya tokens. They were initialized with the average embeddings of the old tokens from which they are combined. |
|
|
|
- training regime: `fp32` |
|
- batch_size: 6 |
|
- grad_acc_steps: 4 |
|
- max_length: 128 |
|
- optimizer: Adafactor |
|
- lr: 1e-4 |
|
- clip_threshold=1.0 |
|
- weight_decay: 1e-3 |
|
- warmup_steps: 3_000 (with a linear warmup from 0) |
|
- training_steps: 220_000 |
|
- weight_loss_coef: 100 (a coefficient for the additional penalty, MSE between the embeddings of old tokens and their values for NLLB-200) |
|
|
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
A standard encoder-decoder translation model with cross-entropy loss. |
|
|
|
### Compute Infrastructure |
|
|
|
Google Colab with a T4 GPU. |
|
|
|
``` |
|
pip install --upgrade sentencepiece transformers==4.40 datasets sacremoses editdistance sacrebleu razdel ctranslate2 |
|
``` |
|
|
|
## Model Card Contact |
|
|
|
@cointegrated |