--- license: apache-2.0 language: - fr library_name: transformers tags: - NMT - commonvoice - pytorch - pictograms - translation metrics: - sacrebleu inference: false --- # t2p-nmt-commonvoice *t2p-nmt-commonvoice* is a text-to-pictograms translation model built by training from scratch the [NMT](https://github.com/facebookresearch/fairseq/blob/main/examples/translation/README.md) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)). The model is used only for **inference**. ## Training details The model was trained with [Fairseq](https://github.com/facebookresearch/fairseq/blob/main/examples/translation/README.md). ### Datasets The [Propicto-commonvoice dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CommmonVoice v.15.0 corpus. This dataset was built with the method presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets. | **Split** | **Number of utterances** | |:-----------:|:-----------------------:| | train | 527,390 | | valid | 16,124 | | test | 16,120 | ### Parameters This is the arguments in the training pipeline : ```bash fairseq-train \ data-bin/commonvoice.tokenized.fr-frp \ --arch transformer_iwslt_de_en --share-decoder-input-output-embed \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --dropout 0.3 --weight-decay 0.0001 \ --save-dir exp_commonvoice/checkpoints/nmt_fr_frp_commonvoice \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 4096 \ --eval-bleu \ --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \ --eval-bleu-detok moses \ --eval-bleu-remove-bpe \ --eval-bleu-print-samples \ --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \ --max-epoch 40 \ --keep-best-checkpoints 5 \ --keep-last-epochs 5 ``` ### Evaluation The model was evaluated with sacreBLEU, where we compared the reference pictogram translation with the model hypothesis. ```bash fairseq-generate exp_commonvoice/data-bin/commonvoice.tokenized.fr-frp \ --path exp_commonvoice/checkpoints/nmt_fr_frp_commonvoice/checkpoint.best_bleu_86.0600.pt \ --batch-size 128 --beam 5 --remove-bpe > gen_cv.out ``` The output file prints the following information : ```txt S-2724 la planète terre T-2724 le planète_terre H-2724 -0.08702446520328522 le planète_terre D-2724 -0.08702446520328522 le planète_terre P-2724 -0.1058 -0.0340 -0.1213 Generate test with beam=5: BLEU4 = 82.60, 92.5/85.5/79.5/74.1 (BP=1.000, ratio=1.027, syslen=138507, reflen=134811) ``` ### Results Comparison to other translation models : | **Model** | **validation** | **test** | |:-----------:|:-----------------------:|:-----------------------:| | t2p-t5-large-commonvoice | 86.3 | 86.5 | | **t2p-nmt-commonvoice** | 86.0 | 82.6 | | t2p-mbart-large-cc25-commonvoice | 72.3 | 72.3 | | t2p-nllb-200-distilled-600M-commonvoice | **87.4** | **87.6** | ### Environmental Impact Training was performed using a single Nvidia V100 GPU with 32 GB of memory which took around 2 hours in total. ## Using t2p-nmt-commonvoice model The scripts to use the *t2p-nmt-commonvoice* model are located in the [speech-to-pictograms GitHub repository](https://github.com/macairececile/speech-to-pictograms). ## Information - **Language(s):** French - **License:** Apache-2.0 - **Developed by:** Cécile Macaire - **Funded by** - GENCI-IDRIS (Grant 2023-AD011013625R1) - PROPICTO ANR-20-CE93-0005 - **Authors** - Cécile Macaire - Chloé Dion - Emmanuelle Esperança-Rodier - Benjamin Lecouteux - Didier Schwab ## Citation If you use this model for your own research work, please cite as follows: ```bibtex @inproceedings{macaire_jeptaln2024, title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}}, author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle}, url = {https://inria.hal.science/hal-04623007}, booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}}, address = {Toulouse, France}, publisher = {{ATALA \& AFPC}}, volume = {1 : articles longs et prises de position}, pages = {22-35}, year = {2024} } ```