Update README.md

2f7131c verified 5 months ago

5.47 kB

	---
	license: apache-2.0
	language:
	- fr
	library_name: transformers
	tags:
	- nllb
	- commonvoice
	- orfeo
	- tedx
	- pytorch
	- pictograms
	- translation
	metrics:
	- sacrebleu
	inference: false
	---

	# t2p-nllb-200-distilled-600M-all

	t2p-nllb-200-distilled-600M-all is a text-to-pictograms translation model built by fine-tuning the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
	The model is used only for inference.

	## Training details

	### Datasets

	The model was fine-tuned on a set of 4 training datasets :
	- [Propicto-commonvoice dataset](https://www.ortolang.fr/market/corpora/propicto), which was created from the CommmonVoice v.15.0 corpus.
	- [Propicto-orfeo dataset](https://www.ortolang.fr/market/corpora/propicto), which was created from the CEFC-orféo corpus.
	- Propicto-tedx dataset, which was created from the French part of the Multilingual TEDx corpus.
	- Propicto-polylexical, a dataset built from scratch with sentences and pictogram translations containing polylexical terms (only used for training to augment the data).

	All the datasets were built with the method presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.

	\| Corpus \| train \| valid \| test \|
	\|:-----------:\|:-------:\|:-------:\|:-------:\|
	\| Propicto-commonvoice \| 527,390 \| 16,124 \| 16,120 \|
	\| Propicto-orfeo \| 231,374 \| 28,796 \| 29,009 \|
	\| Propicto-tedx \| 85,106 \| 749 \| 804 \|
	\| Propicto-polylexical \| 1,462 \| - \| - \|
	\|TOTAL \| 845,332 \| 45,669 \| 45,933 \|

	### Parameters

	A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline :

	```python
	training_args = Seq2SeqTrainingArguments(
	output_dir="checkpoints_corpus_v2/",
	evaluation_strategy="epoch",
	save_strategy="epoch",
	learning_rate=2e-5,
	per_device_train_batch_size=32,
	per_device_eval_batch_size=32,
	weight_decay=0.01,
	save_total_limit=3,
	num_train_epochs=40,
	predict_with_generate=True,
	fp16=True,
	load_best_model_at_end=True
	)
	```

	### Evaluation

	The model was evaluated with [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu/blob/d94719691d29f7adf7151c8b1471de579a78a280/sacrebleu.py), where we compared the reference pictogram translation with the model hypothesis.

	### Results

	\| Model \| validation \| test \|
	\|:-----------:\|:-----------------------:\|:-----------------------:\|
	\| t2p-nllb-200-distilled-600M-all \| 92.4 \| - \|

	### Environmental Impact

	Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory, which took 8.5 hours in total.

	## Using t2p-nllb-200-distilled-600M-all model with HuggingFace transformers

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	source_lang = "fr"
	target_lang = "frp"
	max_input_length = 128
	max_target_length = 128

	tokenizer = AutoTokenizer.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-all")
	model = AutoModelForSeq2SeqLM.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-all")

	inputs = tokenizer("Je mange une pomme", return_tensors="pt").input_ids
	outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
	pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
	```

	## Linking the predicted sequence of tokens to the corresponding ARASAAC pictograms

	```python
	import pandas as pd

	def process_output_trad(pred):
	return pred.split()

	def read_lexicon(lexicon):
	df = pd.read_csv(lexicon, sep='\t')
	df['keyword_no_cat'] = df['lemma'].str.split(' #').str[0].str.strip().str.replace(' ', '_')
	return df

	def get_id_picto_from_predicted_lemma(df_lexicon, lemma):
	id_picto = df_lexicon.loc[df_lexicon['keyword_no_cat'] == lemma, 'id_picto'].tolist()
	return (id_picto[0], lemma) if id_picto else (0, lemma)

	lexicon = read_lexicon("lexicon.csv")
	sentence_to_map = process_output_trad(pred)
	pictogram_ids = [get_id_picto_from_predicted_lemma(lexicon, lemma) for lemma in sentence_to_map]
	```

	## Viewing the predicted sequence of ARASAAC pictograms in a HTML file

	```python
	def generate_html(ids):
	html_content = '<html><body>'
	for picto_id, lemma in ids:
	if picto_id != 0: # ignore invalid IDs
	img_url = f"https://static.arasaac.org/pictograms/{picto_id}/{picto_id}_500.png"
	html_content += f'''
	<figure style="display:inline-block; margin:1px;">
	<img src="{img_url}" alt="{lemma}" width="200" height="200" />
	<figcaption>{lemma}</figcaption>
	</figure>
	'''
	html_content += '</body></html>'
	return html_content

	html = generate_html(pictogram_ids)
	with open("pictograms.html", "w") as file:
	file.write(html)
	```

	## Information

	- Language(s): French
	- License: Apache-2.0
	- Developed by: Cécile Macaire
	- Funded by
	- GENCI-IDRIS (Grant 2023-AD011013625R1)
	- PROPICTO ANR-20-CE93-0005
	- Authors
	- Cécile Macaire
	- Chloé Dion
	- Emmanuelle Esperança-Rodier
	- Benjamin Lecouteux
	- Didier Schwab