BrunoHays commited on
Commit
371dc7c
1 Parent(s): 7ef734b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - fr
5
+ pipeline_tag: automatic-speech-recognition
6
+ ---
7
+ # Whisper-Large-V3-Illuin-French
8
+
9
+ This model is a finetuned variant of openai's [whisper-large-v3](openai/whisper-large-v3) model.
10
+ It has been finetuned on a dataset of more than 18 000 hours of french speech.
11
+
12
+
13
+ This model has been converted and tested into some other formats to allow use with the most popular inference frameworks:
14
+ - transformers
15
+ - openai-whisper
16
+ - fasterwhisper
17
+ - whisper.cpp
18
+
19
+ # Training details
20
+
21
+ ## Dataset Composition:
22
+
23
+ The dataset is a compilation of various popular French ASR (Automatic Speech Recognition) datasets, including:
24
+
25
+ - CommonVoice 13 French
26
+ - LibriSpeech French
27
+ - African accented French
28
+ - TEDx French
29
+ - VoxPopuli French
30
+ - Fleurs French
31
+
32
+ The total dataset comprises a little over 2 500 hours of speech data from these sources.
33
+ Additionally, it includes transcribed french speech scraped from the internet.
34
+ In total, this dataset exceeds 18 000 hours of speech data, which makes it one of the largest french asr datasets
35
+ assembled to date.
36
+
37
+ ## Dataset processings
38
+
39
+ The scrapped dataset contained a lot of bogus transcription. To filter them, we took inspiration in the [original whisper paper](https://cdn.openai.com/papers/whisper.pdf) and removed samples that satisfy the following conditions:
40
+ - samples containing no punctuation (probably automatically generated)
41
+ - samples where either the audio of the transcription was not in french
42
+ - samples where the WER computed in comparison with openai whisper-medium was very high
43
+ As a result we removed more than half of the content and obtained a 16000 hours long french asr dataset.
44
+
45
+ To compile the "classic datasets," extensive filtering wasn't necessary due to their cleaner nature. Our main task involved adding punctuation to datasets lacking it, such as VoxPopuli. To achieve this, we utilized the Mixtral-8X7B model for generating punctuated annotations. Additionally, VoxPopuli presented some encoding and OCR issues, which were resolved simultaneously by the LLM.
46
+
47
+ Regarding numerical representations, we opted not to convert written numbers to digits (e.g., "dix-sept" to "17"). This decision stemmed from observing that many numbers were poorly transcribed. For instance, "17" might be transcribed as "dix sept" without a hyphen, resulting in confusion when converted to "10 7." Instead, we relied on the prevalence of numbers in the scraped dataset to encourage the model to prefer digit-based representations.
48
+
49
+ In the final step, we subjected all audio files from the "classic datasets" to an audio degradation pipeline. This pipeline applied various compression codecs and introduced issues like packet loss, simulating conditions frequently encountered in real call-center environments. The goal was to enhance the model's ability to understand and process this type of audio, thereby improving its performance in real-world scenarios.
50
+
51
+ ## Training
52
+
53
+ We trained on 2 epochs with an effective batch size of 256, a maximum learning rate of 1e-5 and a linear learning rate scheduler with 500 warmup steps.
54
+ The full dataset being prohibitively large, we used [mosaicml streaming dataset](https://docs.mosaicml.com/projects/streaming/en/stable/) to enable streaming of the dataset samples and instant mid-epoch resumption.
55
+
56
+
57
+ # Performance
58
+
59
+ The French ASR datasets lacked a publicly available dataset of real call-center conditions, akin to the Switchboard dataset in English.
60
+ To address this gap, we filtered and cleaned the [Accueil_UBS dataset sourced from Ortolang](https://huggingface.co/datasets/BrunoHays/UBS/tree/main). This preparation enabled the evaluation of ASR models under conditions similar to those encountered in call-center environments.
61
+
62
+ # Inference
63
+
64
+ We offer the model in various formats to ensure compatibility with the most widely used inference frameworks.
65
+ It's important to note that the model hasn't undergone fine-tuning with timestamps, thus it cannot accurately predict timestamps on its own.
66
+ However, leveraging cross-attention enables us to obtain more precise timestamps at a lower computational cost.
67
+ In most frameworks, enabling this feature involves adding parameters such as without_timestamps=True and word_timestamps=True.
68
+
69
+ While it can still handle receiving previous text during inference, its performance under this condition hasn't been quantitatively evaluated. Additionally, it's been observed that enabling this option raises the risk of hallucination based on the base OpenAI model. Therefore, it's advised to disable this option to mitigate potential issues
70
+
71
+ ## Examples:
72
+ transformers:
73
+
74
+ ```python
75
+ TODO
76
+ ```
77
+
78
+ openai-whisper:
79
+ ```python
80
+ import whisper
81
+ whisper_model = whisper.load_model("converted_models/openai/whisper-large-small-yt-os-V2")
82
+ result = whisper_model.transcribe("long_audio.wav", temperature=0,
83
+ condition_on_previous_text=False,
84
+ language="french", without_timestamps=True, word_timestamps=True)
85
+ ```
86
+
87
+ faster-whisper:
88
+ ```python
89
+ from faster_whisper import WhisperModel
90
+ model = WhisperModel("converted_models/ctranslate2/whisper-large-small-yt-os-V2-fp32", device="cpu", compute_type="float32")
91
+
92
+ segments, info = model.transcribe("long_audio.wav",
93
+ without_timestamps=True,
94
+ word_timestamps=True,
95
+ temperature=0,
96
+ condition_on_previous_text=False,
97
+ task="transcribe",
98
+ language="fr")
99
+
100
+
101
+ ```
102
+
103
+ Whisper.cpp: works out-of-the-box
104
+
105
+ ```bash
106
+ ./main -f long_audio.wav -l fr -mc 0 -m ../converted_models/cpp/ggml-model.bin
107
+ ```
108
+
109
+ # TODO: Insérer tableau de perfs + liens pour les modèles convertis