Edit model card

opus-mt-tc-bible-big-deu_eng_fra_por_spa-fiu

Table of Contents

Model Details

Neural machine translation model for translating from unknown (deu+eng+fra+por+spa) to Finno-Ugrian languages (fiu).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>chm<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>chm<< Replace this with text in an accepted source language.",
    ">>vro<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-deu_eng_fra_por_spa-fiu"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-fiu")
print(pipe(">>chm<< Replace this with text in an accepted source language."))

Training

Evaluation

langpair testset chr-F BLEU #sent #words
deu-est tatoeba-test-v2021-08-07 0.76586 57.8 244 1413
deu-fin tatoeba-test-v2021-08-07 0.64286 40.7 2647 15024
deu-hun tatoeba-test-v2021-08-07 0.57007 31.2 15342 105152
eng-est tatoeba-test-v2021-08-07 0.69134 50.6 1359 7992
eng-fin tatoeba-test-v2021-08-07 0.62482 37.6 10690 65122
eng-hun tatoeba-test-v2021-08-07 0.59750 35.9 13037 79562
fra-fin tatoeba-test-v2021-08-07 0.65723 45.0 1920 9730
fra-hun tatoeba-test-v2021-08-07 0.63096 40.6 2494 13753
por-fin tatoeba-test-v2021-08-07 0.76811 58.1 477 2379
por-hun tatoeba-test-v2021-08-07 0.64930 42.5 2500 14063
spa-fin tatoeba-test-v2021-08-07 0.66220 43.4 2513 14131
spa-hun tatoeba-test-v2021-08-07 0.63596 42.0 2500 14599
eng-fin flores101-devtest 0.57265 21.9 1012 18781
fra-hun flores101-devtest 0.52691 21.2 1012 22183
por-fin flores101-devtest 0.53772 18.6 1012 18781
por-hun flores101-devtest 0.53275 21.8 1012 22183
spa-est flores101-devtest 0.50142 15.2 1012 19788
spa-fin flores101-devtest 0.50401 13.7 1012 18781
deu-est flores200-devtest 0.55333 21.2 1012 19788
deu-fin flores200-devtest 0.54020 18.3 1012 18781
deu-hun flores200-devtest 0.53579 22.0 1012 22183
eng-est flores200-devtest 0.59496 26.1 1012 19788
eng-fin flores200-devtest 0.57811 23.1 1012 18781
eng-hun flores200-devtest 0.57670 26.7 1012 22183
fra-est flores200-devtest 0.54442 21.2 1012 19788
fra-fin flores200-devtest 0.53768 18.5 1012 18781
fra-hun flores200-devtest 0.52691 21.2 1012 22183
por-est flores200-devtest 0.48227 15.6 1012 19788
por-fin flores200-devtest 0.53772 18.6 1012 18781
por-hun flores200-devtest 0.53275 21.8 1012 22183
spa-est flores200-devtest 0.50142 15.2 1012 19788
spa-fin flores200-devtest 0.50401 13.7 1012 18781
spa-hun flores200-devtest 0.49444 16.4 1012 22183
deu-hun newssyscomb2009 0.49607 18.1 502 9733
eng-hun newssyscomb2009 0.50580 18.3 502 9733
fra-hun newssyscomb2009 0.49415 17.8 502 9733
spa-hun newssyscomb2009 0.48559 16.9 502 9733
deu-hun newstest2008 0.48855 17.2 2051 41875
eng-hun newstest2008 0.47636 15.9 2051 41875
fra-hun newstest2008 0.48598 17.7 2051 41875
spa-hun newstest2008 0.47888 17.1 2051 41875
deu-hun newstest2009 0.48692 18.1 2525 54965
eng-hun newstest2009 0.49507 18.4 2525 54965
fra-hun newstest2009 0.48961 18.6 2525 54965
spa-hun newstest2009 0.48496 18.1 2525 54965
eng-fin newstest2015 0.56896 22.8 1370 19735
eng-fin newstest2016 0.57934 24.3 3000 47678
eng-fin newstest2017 0.60204 26.5 3002 45269
eng-est newstest2018 0.56276 23.8 2000 36269
eng-fin newstest2018 0.52953 17.4 3000 44836
eng-fin newstest2019 0.55882 24.2 1997 38369
eng-fin newstestALL2016 0.57934 24.3 3000 47678
eng-fin newstestALL2017 0.60204 26.5 3002 45269
eng-fin newstestB2016 0.54388 19.9 3000 45766
eng-fin newstestB2017 0.56369 22.6 3002 45506
deu-est ntrex128 0.51761 18.6 1997 38420
deu-fin ntrex128 0.50759 15.5 1997 35701
deu-hun ntrex128 0.46171 15.6 1997 44462
eng-est ntrex128 0.57099 24.4 1997 38420
eng-fin ntrex128 0.53413 18.5 1997 35701
eng-hun ntrex128 0.47342 16.6 1997 44462
fra-est ntrex128 0.50712 17.7 1997 38420
fra-fin ntrex128 0.49215 14.2 1997 35701
fra-hun ntrex128 0.44873 14.9 1997 44462
por-est ntrex128 0.48098 15.1 1997 38420
por-fin ntrex128 0.50875 15.0 1997 35701
por-hun ntrex128 0.45817 15.5 1997 44462
spa-est ntrex128 0.52158 18.5 1997 38420
spa-fin ntrex128 0.50947 15.2 1997 35701
spa-hun ntrex128 0.46051 16.1 1997 44462

Citation Information

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

  • transformers version: 4.45.1
  • OPUS-MT git hash: 0882077
  • port time: Tue Oct 8 09:01:19 EEST 2024
  • port machine: LM0-400-22516.local
Downloads last month
15
Safetensors
Model size
237M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-fiu

Evaluation results