Edit model card

opus-mt-tc-bible-big-fiu-deu_eng_fra_por_spa

Table of Contents

Model Details

Neural machine translation model for translating from Finno-Ugrian languages (fiu) to unknown (deu+eng+fra+por+spa).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>deu<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-fiu-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-fiu-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

Training

Evaluation

langpair testset chr-F BLEU #sent #words
est-deu tatoeba-test-v2021-08-07 0.69451 53.9 244 1611
est-eng tatoeba-test-v2021-08-07 0.72437 58.2 1359 8811
fin-deu tatoeba-test-v2021-08-07 0.66025 47.3 2647 19163
fin-eng tatoeba-test-v2021-08-07 0.69685 53.7 10690 80552
fin-fra tatoeba-test-v2021-08-07 0.65900 48.3 1920 12193
fin-por tatoeba-test-v2021-08-07 0.72250 54.0 477 3021
fin-spa tatoeba-test-v2021-08-07 0.69600 52.1 2513 16912
hun-deu tatoeba-test-v2021-08-07 0.62418 41.1 15342 127344
hun-eng tatoeba-test-v2021-08-07 0.65626 48.7 13037 94699
hun-fra tatoeba-test-v2021-08-07 0.66840 50.3 2494 16914
hun-por tatoeba-test-v2021-08-07 0.65281 43.1 2500 16563
hun-spa tatoeba-test-v2021-08-07 0.67467 48.7 2500 16670
est-deu flores101-devtest 0.55353 25.7 1012 25094
est-eng flores101-devtest 0.61930 34.7 1012 24721
est-fra flores101-devtest 0.58199 31.3 1012 28343
est-por flores101-devtest 0.54388 26.5 1012 26519
fin-eng flores101-devtest 0.59914 32.2 1012 24721
fin-por flores101-devtest 0.55156 27.1 1012 26519
hun-eng flores101-devtest 0.61198 33.5 1012 24721
hun-fra flores101-devtest 0.57776 30.8 1012 28343
hun-por flores101-devtest 0.56263 28.4 1012 26519
hun-spa flores101-devtest 0.49140 20.7 1012 29199
est-deu flores200-devtest 0.55825 26.3 1012 25094
est-eng flores200-devtest 0.62404 35.4 1012 24721
est-fra flores200-devtest 0.58580 31.7 1012 28343
est-por flores200-devtest 0.55070 27.3 1012 26519
est-spa flores200-devtest 0.50188 21.5 1012 29199
fin-deu flores200-devtest 0.54281 24.0 1012 25094
fin-eng flores200-devtest 0.60642 33.1 1012 24721
fin-fra flores200-devtest 0.57540 30.5 1012 28343
fin-por flores200-devtest 0.55497 27.4 1012 26519
fin-spa flores200-devtest 0.49847 21.4 1012 29199
hun-deu flores200-devtest 0.55180 25.1 1012 25094
hun-eng flores200-devtest 0.61466 34.0 1012 24721
hun-fra flores200-devtest 0.57670 30.6 1012 28343
hun-por flores200-devtest 0.56510 28.9 1012 26519
hun-spa flores200-devtest 0.49681 21.3 1012 29199
hun-deu newssyscomb2009 0.49819 17.9 502 11271
hun-eng newssyscomb2009 0.52063 24.4 502 11818
hun-fra newssyscomb2009 0.51589 22.0 502 12331
hun-spa newssyscomb2009 0.51508 22.7 502 12503
hun-deu newstest2008 0.50164 19.0 2051 47447
hun-eng newstest2008 0.49802 20.4 2051 49380
hun-fra newstest2008 0.51012 21.6 2051 52685
hun-spa newstest2008 0.50719 22.3 2051 52586
hun-deu newstest2009 0.49902 18.6 2525 62816
hun-eng newstest2009 0.50950 22.3 2525 65399
hun-fra newstest2009 0.50742 21.6 2525 69263
hun-spa newstest2009 0.50788 22.2 2525 68111
fin-eng newstest2015 0.55249 27.0 1370 27270
fin-eng newstest2016 0.57961 30.7 3000 62945
fin-eng newstest2017 0.59973 33.2 3002 61846
est-eng newstest2018 0.59190 31.5 2000 45405
fin-eng newstest2018 0.52373 24.4 3000 62325
fin-eng newstest2019 0.57079 30.3 1996 36215
fin-eng newstestB2017 0.56420 28.9 3002 61846
est-deu ntrex128 0.51377 21.4 1997 48761
est-eng ntrex128 0.58358 29.9 1997 47673
est-fra ntrex128 0.52713 24.9 1997 53481
est-por ntrex128 0.50745 22.2 1997 51631
est-spa ntrex128 0.54304 27.5 1997 54107
fin-deu ntrex128 0.50282 19.8 1997 48761
fin-eng ntrex128 0.55545 26.3 1997 47673
fin-fra ntrex128 0.50946 22.9 1997 53481
fin-por ntrex128 0.50404 21.3 1997 51631
fin-spa ntrex128 0.52641 25.5 1997 54107
hun-deu ntrex128 0.49322 18.5 1997 48761
hun-eng ntrex128 0.52964 23.3 1997 47673
hun-fra ntrex128 0.49800 21.8 1997 53481
hun-por ntrex128 0.48941 20.5 1997 51631
hun-spa ntrex128 0.51123 24.2 1997 54107

Citation Information

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

  • transformers version: 4.45.1
  • OPUS-MT git hash: 0882077
  • port time: Tue Oct 8 10:53:49 EEST 2024
  • port machine: LM0-400-22516.local
Downloads last month
10
Safetensors
Model size
237M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Helsinki-NLP/opus-mt-tc-bible-big-fiu-deu_eng_fra_por_spa

Evaluation results