|
---
|
|
language: fr
|
|
license: mit
|
|
tags:
|
|
- bert
|
|
- language-model
|
|
- flaubert
|
|
- french
|
|
- flaubert-base
|
|
- uncased
|
|
- asr
|
|
- speech
|
|
- oral
|
|
- natural language understanding
|
|
- NLU
|
|
- spoken language understanding
|
|
- SLU
|
|
- understanding
|
|
---
|
|
|
|
# FlauBERT-Oral models: Using ASR-Generated Text for Spoken Language Modeling
|
|
|
|
**FlauBERT-Oral** are French BERT models trained on a very large amount of automatically transcribed speech from 350,000 hours of diverse French TV shows. They were trained with the [**FlauBERT software**](https://github.com/getalp/Flaubert) using the same parameters as the [flaubert-base-uncased](https://huggingface.co/flaubert/flaubert_base_uncased) model (12 layers, 12 attention heads, 768 dims, 137M parameters, uncased).
|
|
|
|
## Available FlauBERT-Oral models
|
|
|
|
- `flaubert-oral-asr` : trained from scratch on ASR data, keeping the BPE tokenizer and vocabulary of flaubert-base-uncased
|
|
- `flaubert-oral-asr_nb` : trained from scratch on ASR data, BPE tokenizer is also trained on the same corpus
|
|
- `flaubert-oral-mixed` : trained from scratch on a mixed corpus of ASR and text data, BPE tokenizer is also trained on the same corpus
|
|
- `flaubert-oral-ft` : fine-tuning of flaubert-base-uncased for a few epochs on ASR data
|
|
|
|
## Usage for sequence classification
|
|
```python
|
|
flaubert_tokenizer = FlaubertTokenizer.from_pretrained("nherve/flaubert-oral-asr")
|
|
flaubert_classif = FlaubertForSequenceClassification.from_pretrained("nherve/flaubert-oral-asr", num_labels=14)
|
|
flaubert_classif.sequence_summary.summary_type = 'mean'
|
|
# Then, train your model
|
|
```
|
|
|
|
## References
|
|
If you use FlauBERT-Oral models for your scientific publication, or if you find the resources in this repository useful, please cite the following papers:
|
|
```
|
|
@InProceedings{herve2022flaubertoral,
|
|
author = {Herv\'{e}, Nicolas and Pelloin, Valentin and Favre, Benoit and Dary, Franck and Laurent, Antoine and Meignier, Sylvain and Besacier, Laurent},
|
|
title = {Using ASR-Generated Text for Spoken Language Modeling},
|
|
booktitle = {Proceedings of "Challenges & Perspectives in Creating Large Language Models" ACL 2022 Workshop},
|
|
month = {May},
|
|
year = {2022}
|
|
}
|
|
```
|
|
|