vpelloin/MEDIA_NLU-flaubert_base_uncased

This is a Natural Language Understanding (NLU) model for the French MEDIA benchmark. It maps each input words into outputs concepts tags (76 available).

This model is trained using flaubert/flaubert_base_uncased as its inital checkpoint. It obtained 12.40% CER (lower is better) in the MEDIA test set, in our Interspeech 2023 publication, using Kaldi ASR transcriptions.

Available MEDIA NLU models:

vpelloin/MEDIA_NLU-flaubert_base_cased: MEDIA NLU model trained using flaubert/flaubert_base_cased. Obtains 13.20% CER on MEDIA test.
vpelloin/MEDIA_NLU-flaubert_base_uncased: MEDIA NLU model trained using flaubert/flaubert_base_uncased. Obtains 12.40% CER on MEDIA test.
vpelloin/MEDIA_NLU-flaubert_oral_ft: MEDIA NLU model trained using nherve/flaubert-oral-ft. Obtains 11.98% CER on MEDIA test.
vpelloin/MEDIA_NLU-flaubert_oral_mixed: MEDIA NLU model trained using nherve/flaubert-oral-mixed. Obtains 12.47% CER on MEDIA test.
vpelloin/MEDIA_NLU-flaubert_oral_asr: MEDIA NLU model trained using nherve/flaubert-oral-asr. Obtains 12.43% CER on MEDIA test.
vpelloin/MEDIA_NLU-flaubert_oral_asr_nb: MEDIA NLU model trained using nherve/flaubert-oral-asr_nb. Obtains 12.24% CER on MEDIA test.

Usage with Pipeline

from transformers import pipeline

generator = pipeline(
    model="vpelloin/MEDIA_NLU-flaubert_base_uncased",
    task="token-classification"
)

sentences = [
    "je voudrais réserver une chambre à paris pour demain et lundi",
    "d'accord pour l'hôtel à quatre vingt dix euros la nuit",
    "deux nuits s'il vous plait",
    "dans un hôtel avec piscine à marseille"
 ]

for sentence in sentences:
    print([(tok['word'], tok['entity']) for tok in generator(sentence)])

Usage with AutoTokenizer/AutoModel

from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification
)
tokenizer = AutoTokenizer.from_pretrained(
    "vpelloin/MEDIA_NLU-flaubert_base_uncased"
)
model = AutoModelForTokenClassification.from_pretrained(
    "vpelloin/MEDIA_NLU-flaubert_base_uncased"
)

sentences = [
    "je voudrais réserver une chambre à paris pour demain et lundi",
    "d'accord pour l'hôtel à quatre vingt dix euros la nuit",
    "deux nuits s'il vous plait",
    "dans un hôtel avec piscine à marseille"
 ]
inputs = tokenizer(sentences, padding=True, return_tensors='pt')
outputs = model(**inputs).logits
print([
    [model.config.id2label[i] for i in b]
    for b in outputs.argmax(dim=-1).tolist()
])

Reference

If you use this model for your scientific publication, or if you find the resources in this repository useful, please cite the following paper:

@inproceedings{pelloin22_interspeech,
  author={Valentin Pelloin and Franck Dary and Nicolas Hervé and Benoit Favre and Nathalie Camelin and Antoine LAURENT and Laurent Besacier},
  title={ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={3453--3457},
  doi={10.21437/Interspeech.2022-352}
}