|
--- |
|
license: apache-2.0 |
|
language: |
|
- fr |
|
library_name: transformers |
|
tags: |
|
- nllb |
|
- commonvoice |
|
- orfeo |
|
- tedx |
|
- pytorch |
|
- pictograms |
|
- translation |
|
metrics: |
|
- sacrebleu |
|
inference: false |
|
--- |
|
|
|
# t2p-nllb-200-distilled-600M-all |
|
|
|
*t2p-nllb-200-distilled-600M-all* is a text-to-pictograms translation model built by fine-tuning the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)). |
|
The model is used only for **inference**. |
|
|
|
## Training details |
|
|
|
### Datasets |
|
|
|
The model was fine-tuned on a set of 4 training datasets : |
|
- [Propicto-commonvoice dataset](https://www.ortolang.fr/market/corpora/propicto), which was created from the CommmonVoice v.15.0 corpus. |
|
- [Propicto-orfeo dataset](https://www.ortolang.fr/market/corpora/propicto), which was created from the CEFC-orféo corpus. |
|
- Propicto-tedx dataset, which was created from the French part of the Multilingual TEDx corpus. |
|
- Propicto-polylexical, a dataset built from scratch with sentences and pictogram translations containing polylexical terms (only used for training to augment the data). |
|
|
|
All the datasets were built with the method presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets. |
|
|
|
| **Corpus** | **train** | **valid** | **test** | |
|
|:-----------:|:-------:|:-------:|:-------:| |
|
| Propicto-commonvoice | 527,390 | 16,124 | 16,120 | |
|
| Propicto-orfeo | 231,374 | 28,796 | 29,009 | |
|
| Propicto-tedx | 85,106 | 749 | 804 | |
|
| Propicto-polylexical | 1,462 | - | - | |
|
|**TOTAL** | **845,332** | **45,669** | **45,933** | |
|
|
|
### Parameters |
|
|
|
A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline : |
|
|
|
```python |
|
training_args = Seq2SeqTrainingArguments( |
|
output_dir="checkpoints_corpus_v2/", |
|
evaluation_strategy="epoch", |
|
save_strategy="epoch", |
|
learning_rate=2e-5, |
|
per_device_train_batch_size=32, |
|
per_device_eval_batch_size=32, |
|
weight_decay=0.01, |
|
save_total_limit=3, |
|
num_train_epochs=40, |
|
predict_with_generate=True, |
|
fp16=True, |
|
load_best_model_at_end=True |
|
) |
|
``` |
|
|
|
### Evaluation |
|
|
|
The model was evaluated with [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu/blob/d94719691d29f7adf7151c8b1471de579a78a280/sacrebleu.py), where we compared the reference pictogram translation with the model hypothesis. |
|
|
|
### Results |
|
|
|
| **Model** | **validation** | **test** | |
|
|:-----------:|:-----------------------:|:-----------------------:| |
|
| t2p-nllb-200-distilled-600M-all | 92.4 | - | |
|
|
|
### Environmental Impact |
|
|
|
Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory, which took 8.5 hours in total. |
|
|
|
## Using t2p-nllb-200-distilled-600M-all model with HuggingFace transformers |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
source_lang = "fr" |
|
target_lang = "frp" |
|
max_input_length = 128 |
|
max_target_length = 128 |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-all") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-all") |
|
|
|
inputs = tokenizer("Je mange une pomme", return_tensors="pt").input_ids |
|
outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95) |
|
pred = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
``` |
|
|
|
## Linking the predicted sequence of tokens to the corresponding ARASAAC pictograms |
|
|
|
```python |
|
import pandas as pd |
|
|
|
def process_output_trad(pred): |
|
return pred.split() |
|
|
|
def read_lexicon(lexicon): |
|
df = pd.read_csv(lexicon, sep='\t') |
|
df['keyword_no_cat'] = df['lemma'].str.split(' #').str[0].str.strip().str.replace(' ', '_') |
|
return df |
|
|
|
def get_id_picto_from_predicted_lemma(df_lexicon, lemma): |
|
id_picto = df_lexicon.loc[df_lexicon['keyword_no_cat'] == lemma, 'id_picto'].tolist() |
|
return (id_picto[0], lemma) if id_picto else (0, lemma) |
|
|
|
lexicon = read_lexicon("lexicon.csv") |
|
sentence_to_map = process_output_trad(pred) |
|
pictogram_ids = [get_id_picto_from_predicted_lemma(lexicon, lemma) for lemma in sentence_to_map] |
|
``` |
|
|
|
## Viewing the predicted sequence of ARASAAC pictograms in a HTML file |
|
|
|
```python |
|
def generate_html(ids): |
|
html_content = '<html><body>' |
|
for picto_id, lemma in ids: |
|
if picto_id != 0: # ignore invalid IDs |
|
img_url = f"https://static.arasaac.org/pictograms/{picto_id}/{picto_id}_500.png" |
|
html_content += f''' |
|
<figure style="display:inline-block; margin:1px;"> |
|
<img src="{img_url}" alt="{lemma}" width="200" height="200" /> |
|
<figcaption>{lemma}</figcaption> |
|
</figure> |
|
''' |
|
html_content += '</body></html>' |
|
return html_content |
|
|
|
html = generate_html(pictogram_ids) |
|
with open("pictograms.html", "w") as file: |
|
file.write(html) |
|
``` |
|
|
|
## Information |
|
|
|
- **Language(s):** French |
|
- **License:** Apache-2.0 |
|
- **Developed by:** Cécile Macaire |
|
- **Funded by** |
|
- GENCI-IDRIS (Grant 2023-AD011013625R1) |
|
- PROPICTO ANR-20-CE93-0005 |
|
- **Authors** |
|
- Cécile Macaire |
|
- Chloé Dion |
|
- Emmanuelle Esperança-Rodier |
|
- Benjamin Lecouteux |
|
- Didier Schwab |