opus-mt-tc-bible-big-deu_eng_fra_por_spa-inc

Model Details
Uses
Risks, Limitations and Biases
How to Get Started With the Model
Training
Evaluation
Citation Information
Acknowledgements

Model Details

Neural machine translation model for translating from unknown (deu+eng+fra+por+spa) to Indic languages (inc).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Developed by: Language Technology Research Group at the University of Helsinki
Model Type: Translation (transformer-big)
Release: 2024-05-30
License: Apache-2.0
Language(s):
- Source Language(s): deu eng fra por spa
- Target Language(s): anp asm awa ben bho bpy div dty gbm guj hif hin hne hns kas kok lah mag mai mar nep npi ori pan pli rhg rmy rom san sin skr snd syl urd
- Valid Target Language Labels: >>aee<< >>aeq<< >>anp<< >>anr<< >>asm<< >>awa<< >>bdv<< >>ben<< >>bfb<< >>bfy<< >>bfz<< >>bgc<< >>bgd<< >>bge<< >>bgw<< >>bha<< >>bhb<< >>bhd<< >>bhe<< >>bhi<< >>bho<< >>bht<< >>bhu<< >>bjj<< >>bkk<< >>bmj<< >>bns<< >>bpx<< >>bpy<< >>bra<< >>btv<< >>ccp<< >>cdh<< >>cdi<< >>cdj<< >>cih<< >>clh<< >>ctg<< >>dcc<< >>dhn<< >>dho<< >>div<< >>dmk<< >>dml<< >>doi<< >>dry<< >>dty<< >>dub<< >>duh<< >>dwz<< >>emx<< >>gas<< >>gbk<< >>gbl<< >>gbm<< >>gdx<< >>ggg<< >>ghr<< >>gig<< >>gjk<< >>glh<< >>gra<< >>guj<< >>gwc<< >>gwf<< >>gwt<< >>haj<< >>hca<< >>hif<< >>hif_Latn<< >>hii<< >>hin<< >>hin_Latn<< >>hlb<< >>hne<< >>hns<< >>jdg<< >>jml<< >>jnd<< >>jns<< >>kas<< >>kas_Arab<< >>kas_Deva<< >>kbu<< >>keq<< >>key<< >>kfr<< >>kfs<< >>kft<< >>kfu<< >>kfv<< >>kfx<< >>kfy<< >>khn<< >>khw<< >>kjo<< >>kls<< >>kok<< >>kra<< >>ksy<< >>kvx<< >>kxp<< >>kyw<< >>lah<< >>lbm<< >>lhl<< >>lmn<< >>lss<< >>luv<< >>mag<< >>mai<< >>mar<< >>mby<< >>mjl<< >>mjz<< >>mkb<< >>mke<< >>mki<< >>mvy<< >>mwr<< >>nag<< >>nep<< >>nhh<< >>nli<< >>nlx<< >>noe<< >>noi<< >>npi<< >>odk<< >>omr<< >>ori<< >>ort<< >>pan<< >>pan_Guru<< >>paq<< >>pcl<< >>pgg<< >>phd<< >>phl<< >>pli<< >>plk<< >>plp<< >>pmh<< >>psh<< >>psi<< >>psu<< >>pwr<< >>raj<< >>rei<< >>rhg<< >>rhg_Latn<< >>rjs<< >>rkt<< >>rmi<< >>rmq<< >>rmt<< >>rmy<< >>rom<< >>rtw<< >>san<< >>san_Deva<< >>saz<< >>sbn<< >>sck<< >>scl<< >>sdg<< >>sdr<< >>shd<< >>sin<< >>sjp<< >>skr<< >>smm<< >>smv<< >>snd<< >>snd_Arab<< >>soi<< >>srx<< >>ssi<< >>sts<< >>syl<< >>syl_Sylo<< >>tdb<< >>the<< >>thl<< >>thq<< >>thr<< >>tkb<< >>tkt<< >>tnv<< >>tra<< >>trw<< >>urd<< >>ush<< >>vaa<< >>vah<< >>vas<< >>vav<< >>ved<< >>vgr<< >>wsv<< >>wtm<< >>xka<< >>xxx<<
Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>anp<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>anp<< Replace this with text in an accepted source language.",
    ">>urd<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-deu_eng_fra_por_spa-inc"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-inc")
print(pipe(">>anp<< Replace this with text in an accepted source language."))

Training

Data: opusTCv20230926max50+bt+jhubc (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt
test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
eng-ben	tatoeba-test-v2021-08-07	0.48316	18.1	2500	11654
eng-hin	tatoeba-test-v2021-08-07	0.52587	28.1	5000	32904
eng-mar	tatoeba-test-v2021-08-07	0.52516	24.2	10396	61140
eng-urd	tatoeba-test-v2021-08-07	0.46228	18.8	1663	12155
deu-ben	flores101-devtest	0.44269	10.8	1012	21155
deu-hin	flores101-devtest	0.48314	21.9	1012	27743
eng-ben	flores101-devtest	0.51768	17.4	1012	21155
eng-guj	flores101-devtest	0.54325	22.7	1012	23840
eng-hin	flores101-devtest	0.58472	34.1	1012	27743
fra-ben	flores101-devtest	0.44304	11.1	1012	21155
fra-hin	flores101-devtest	0.48245	22.5	1012	27743
deu-ben	flores200-devtest	0.44696	11.3	1012	21155
deu-guj	flores200-devtest	0.40939	12.0	1012	23840
deu-hin	flores200-devtest	0.48864	22.7	1012	27743
deu-hne	flores200-devtest	0.43166	14.2	1012	26582
deu-mag	flores200-devtest	0.43058	14.2	1012	26516
deu-urd	flores200-devtest	0.41167	14.3	1012	28098
eng-ben	flores200-devtest	0.52088	17.7	1012	21155
eng-guj	flores200-devtest	0.54758	23.2	1012	23840
eng-hin	flores200-devtest	0.58825	34.4	1012	27743
eng-hne	flores200-devtest	0.46144	19.1	1012	26582
eng-mag	flores200-devtest	0.50291	21.9	1012	26516
eng-mar	flores200-devtest	0.49344	15.6	1012	21810
eng-pan	flores200-devtest	0.45635	18.4	1012	27451
eng-sin	flores200-devtest	0.45683	11.8	1012	23278
eng-urd	flores200-devtest	0.48224	20.6	1012	28098
fra-ben	flores200-devtest	0.44486	11.1	1012	21155
fra-guj	flores200-devtest	0.41021	12.2	1012	23840
fra-hin	flores200-devtest	0.48632	22.7	1012	27743
fra-hne	flores200-devtest	0.42777	13.8	1012	26582
fra-mag	flores200-devtest	0.42725	14.3	1012	26516
fra-urd	flores200-devtest	0.40901	13.6	1012	28098
por-ben	flores200-devtest	0.43877	10.7	1012	21155
por-hin	flores200-devtest	0.50121	23.9	1012	27743
por-hne	flores200-devtest	0.42270	14.1	1012	26582
por-mag	flores200-devtest	0.42146	13.7	1012	26516
por-san	flores200-devtest	9.879	0.4	1012	18253
por-urd	flores200-devtest	0.41225	14.5	1012	28098
spa-ben	flores200-devtest	0.42040	8.8	1012	21155
spa-hin	flores200-devtest	0.43977	16.4	1012	27743
eng-hin	newstest2014	0.51541	24.0	2507	60872
eng-guj	newstest2019	0.57815	25.7	998	21924
deu-ben	ntrex128	0.44384	9.9	1997	40095
deu-hin	ntrex128	0.43252	17.0	1997	55219
deu-urd	ntrex128	0.41844	14.8	1997	54259
eng-ben	ntrex128	0.52381	17.3	1997	40095
eng-guj	ntrex128	0.49386	17.2	1997	45335
eng-hin	ntrex128	0.52696	27.4	1997	55219
eng-mar	ntrex128	0.45244	10.8	1997	42375
eng-nep	ntrex128	0.43339	8.8	1997	40570
eng-pan	ntrex128	0.46534	19.5	1997	54355
eng-sin	ntrex128	0.44124	10.5	1997	44429
eng-urd	ntrex128	0.50060	22.4	1997	54259
fra-ben	ntrex128	0.42857	9.4	1997	40095
fra-hin	ntrex128	0.42777	17.4	1997	55219
fra-urd	ntrex128	0.41229	14.3	1997	54259
por-ben	ntrex128	0.44134	10.1	1997	40095
por-hin	ntrex128	0.43461	17.7	1997	55219
por-urd	ntrex128	0.41777	14.5	1997	54259
spa-ben	ntrex128	0.45329	10.6	1997	40095
spa-hin	ntrex128	0.43747	17.9	1997	55219
spa-urd	ntrex128	0.41929	14.6	1997	54259
eng-ben	tico19-test	0.51850	18.6	2100	51695
eng-hin	tico19-test	0.62999	41.9	2100	62680
eng-mar	tico19-test	0.45968	13.0	2100	50872
eng-nep	tico19-test	0.54373	18.7	2100	48363
eng-urd	tico19-test	0.50920	21.7	2100	65312
fra-hin	tico19-test	0.48666	25.6	2100	62680
fra-nep	tico19-test	0.41414	10.0	2100	48363
por-ben	tico19-test	0.45609	12.7	2100	51695
por-hin	tico19-test	0.55530	31.2	2100	62680
por-mar	tico19-test	0.40344	9.7	2100	50872
por-nep	tico19-test	0.47698	12.4	2100	48363
por-urd	tico19-test	0.44747	15.6	2100	65312
spa-ben	tico19-test	0.46418	13.3	2100	51695
spa-hin	tico19-test	0.55526	31.0	2100	62680
spa-mar	tico19-test	0.41189	10.0	2100	50872
spa-nep	tico19-test	0.47414	12.1	2100	48363
spa-urd	tico19-test	0.44788	15.6	2100	65312

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: 0882077
port time: Tue Oct 8 10:09:07 EEST 2024
port machine: LM0-400-22516.local

Helsinki-NLP
/

opus-mt-tc-bible-big-deu_eng_fra_por_spa-inc