File size: 6,371 Bytes
371dc7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
da90024
b18028e
371dc7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
366d492
 
 
371dc7c
 
 
 
 
 
 
 
 
 
5987fc6
371dc7c
1dfddca
 
d1bc315
 
 
 
 
1dfddca
371dc7c
 
 
 
 
 
 
 
 
 
 
a4cc7a8
371dc7c
a4cc7a8
 
 
 
 
 
 
 
 
 
371dc7c
71608ae
371dc7c
 
 
 
 
 
 
 
 
 
 
 
eac2c80
371dc7c
 
 
 
 
 
 
 
 
 
 
 
71608ae
371dc7c
058f3cf
71608ae
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
license: cc-by-4.0
language:
- fr
pipeline_tag: automatic-speech-recognition
---
# Whisper-Large-V3-Illuin-French

This model is a finetuned variant of openai's [whisper-large-v3](openai/whisper-large-v3) model.
It has been finetuned on a dataset of more than 18 000 hours of french speech.


This model has been converted and tested into some other formats to allow use with the most popular inference frameworks: 
 - transformers
 - openai-whisper
 - fasterwhisper
 - whisper.cpp

The models can be found in this [collection](https://huggingface.co/collections/illuin/whisper-large-french-illuin-661684f315ea7b8f42ad7fd1)
# Training details

## Dataset Composition:

The dataset is a compilation of various popular French ASR (Automatic Speech Recognition) datasets, including:

- CommonVoice 13 French
- LibriSpeech French
- African accented French
- TEDx French
- VoxPopuli French
- Fleurs French

The total dataset comprises a little over 2 500 hours of speech data from these sources.
Additionally, it includes transcribed french speech scraped from the internet.
In total, this dataset exceeds 18 000 hours of speech data, which makes it one of the largest french asr datasets
assembled to date.

## Dataset processings

We agressively  filtered and cleaned the raw internet dataset through extensive heuristic filtering, as well as language verification and quality estimation models. Other data sources did not require as much preprocessing, but underwent Large Language Model verification and rephrasing for punctuations and minor correction fixes (Mixtral 8x7B). 
We further enhance our dataset for real-word conditions by stochastically subjecting audio to various compression codecs and simulating issues such as packet lossto  replicate call-center environments.
This extensive preprocessing pipeline enables us to obtain 18k hours of high quality labeled French audio we use to train our SOTA French ASR models.

## Training

We trained on 2 epochs with an effective batch size of 256, a maximum learning rate of 1e-5 and a linear learning rate scheduler with 500 warmup steps.
The full dataset being prohibitively large, we used [mosaicml streaming dataset](https://docs.mosaicml.com/projects/streaming/en/stable/) to enable streaming of the dataset samples and instant mid-epoch resumption.


# Performance

The French ASR datasets lacked a publicly available dataset of real call-center conditions, akin to the Switchboard dataset in English.
To address this gap, we filtered and cleaned the [Accueil_UBS dataset sourced from Ortolang](https://huggingface.co/datasets/BrunoHays/UBS). This preparation enabled the evaluation of ASR models under conditions similar to those encountered in call-center environments.

| Model                                            | librispeech | voxpopuli | fleurs | Accueil_UBS | Common Voice | TEDX | TEDX long form |
|--------------------------------------------------|-------------|-----------|--------|-------------|--------------|------|----------------|
| google-latest-long                               | 0.15        | 0.14      | 0.12   | 0.31        | 0.08         | 0.20 | NA             |
| azure                                            | 0.27        | 0.14      | 0.08   | 0.30        | 0.08         | 0.23 | NA             |
| [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                                 | 0.05        | 0.10      | 0.05   | 0.30        | 0.13         | 0.20 | 0.11           |
| [whisper-large-v3-french-distil-dec16](https://huggingface.co/bofenghuang/whisper-large-v3-french-distil-dec16) | **0.04**        | **0.08**      | 0.05   | 0.25        | **0.04**         | **0.10** | 0.09           |
| **whisper-large-v3-french-illuin**         | **0.04**        | **0.08**      | **0.04**   | **0.20**        | 0.07         | **0.10** | **0.08**           |

# Inference

We offer the model in various formats to ensure compatibility with the most widely used inference frameworks.
It's important to note that the model hasn't undergone fine-tuning with timestamps, thus it cannot accurately predict timestamps on its own.
However, leveraging cross-attention enables us to obtain more precise timestamps at a lower computational cost.
In most frameworks, enabling this feature involves adding parameters such as without_timestamps=True and word_timestamps=True.

While it can still handle receiving previous text during inference, its performance under this condition hasn't been quantitatively evaluated. Additionally, it's been observed that enabling this option raises the risk of hallucination based on the base OpenAI model. Therefore, it's advised to disable this option to mitigate potential issues

## Examples:

transformers:
```python
from transformers import AutomaticSpeechRecognitionPipeline, WhisperForConditionalGeneration
from transformers import AutoModel, AutoTokenizer, AutoFeatureExtractor

model_path = "BrunoHays/whisper-large-v3-french-illuin"
model = WhisperForConditionalGeneration.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoFeatureExtractor.from_pretrained(model_path)
pipe = AutomaticSpeechRecognitionPipeline(model=model, feature_extractor=processor, tokenizer=tokenizer)
transcript = pipe("audio_samples/short_rd.wav", return_timestamps=False)
print(transcript)
```

openai-whisper:
```python
import whisper
whisper_model = whisper.load_model("converted_models/openai/whisper-large-small-yt-os-V2")
result = whisper_model.transcribe("long_audio.wav", temperature=0,
                                  condition_on_previous_text=False,
                                  language="french", without_timestamps=True, word_timestamps=True)
```

faster-whisper:
```python
from faster_whisper import WhisperModel
model = WhisperModel("BrunoHays/whisper-large-v3-french-illuin-ctranslate2-fp16", device="cpu")

segments, info = model.transcribe("long_audio.wav",
                                  without_timestamps=True,
                                  word_timestamps=True,
                                  temperature=0,
                                  condition_on_previous_text=False,
                                  task="transcribe",
                                  language="fr")


```

Whisper.cpp:
```bash
 ./main -f long_audio.wav -l fr -mc 0 -m ggml-model.bin
```