BrunoHays
/

whisper-large-v3-french-illuin

+---
+license: cc-by-4.0
+language:
+- fr
+pipeline_tag: automatic-speech-recognition
+---
+# Whisper-Large-V3-Illuin-French
+This model is a finetuned variant of openai's [whisper-large-v3](openai/whisper-large-v3) model.
+It has been finetuned on a dataset of more than 18 000 hours of french speech.
+This model has been converted and tested into some other formats to allow use with the most popular inference frameworks:
+ - transformers
+ - openai-whisper
+ - fasterwhisper
+ - whisper.cpp
+# Training details
+## Dataset Composition:
+The dataset is a compilation of various popular French ASR (Automatic Speech Recognition) datasets, including:
+- CommonVoice 13 French
+- LibriSpeech French
+- African accented French
+- TEDx French
+- VoxPopuli French
+- Fleurs French
+The total dataset comprises a little over 2 500 hours of speech data from these sources.
+Additionally, it includes transcribed french speech scraped from the internet.
+In total, this dataset exceeds 18 000 hours of speech data, which makes it one of the largest french asr datasets
+assembled to date.
+## Dataset processings
+The scrapped dataset contained a lot of bogus transcription. To filter them, we took inspiration in the [original whisper paper](https://cdn.openai.com/papers/whisper.pdf) and removed samples that satisfy the following conditions:
+ - samples containing no punctuation (probably automatically generated)
+ - samples where either the audio of the transcription was not in french
+ - samples where the WER computed in comparison with openai whisper-medium was very high
+As a result we removed more than half of the content and obtained a 16000 hours long french asr dataset.
+To compile the "classic datasets," extensive filtering wasn't necessary due to their cleaner nature. Our main task involved adding punctuation to datasets lacking it, such as VoxPopuli. To achieve this, we utilized the Mixtral-8X7B model for generating punctuated annotations. Additionally, VoxPopuli presented some encoding and OCR issues, which were resolved simultaneously by the LLM.
+Regarding numerical representations, we opted not to convert written numbers to digits (e.g., "dix-sept" to "17"). This decision stemmed from observing that many numbers were poorly transcribed. For instance, "17" might be transcribed as "dix sept" without a hyphen, resulting in confusion when converted to "10 7." Instead, we relied on the prevalence of numbers in the scraped dataset to encourage the model to prefer digit-based representations.
+In the final step, we subjected all audio files from the "classic datasets" to an audio degradation pipeline. This pipeline applied various compression codecs and introduced issues like packet loss, simulating conditions frequently encountered in real call-center environments. The goal was to enhance the model's ability to understand and process this type of audio, thereby improving its performance in real-world scenarios.
+## Training
+We trained on 2 epochs with an effective batch size of 256, a maximum learning rate of 1e-5 and a linear learning rate scheduler with 500 warmup steps.
+The full dataset being prohibitively large, we used [mosaicml streaming dataset](https://docs.mosaicml.com/projects/streaming/en/stable/) to enable streaming of the dataset samples and instant mid-epoch resumption.
+# Performance
+The French ASR datasets lacked a publicly available dataset of real call-center conditions, akin to the Switchboard dataset in English.
+To address this gap, we filtered and cleaned the [Accueil_UBS dataset sourced from Ortolang](https://huggingface.co/datasets/BrunoHays/UBS/tree/main). This preparation enabled the evaluation of ASR models under conditions similar to those encountered in call-center environments.
+# Inference
+We offer the model in various formats to ensure compatibility with the most widely used inference frameworks.
+It's important to note that the model hasn't undergone fine-tuning with timestamps, thus it cannot accurately predict timestamps on its own.
+However, leveraging cross-attention enables us to obtain more precise timestamps at a lower computational cost.
+In most frameworks, enabling this feature involves adding parameters such as without_timestamps=True and word_timestamps=True.
+While it can still handle receiving previous text during inference, its performance under this condition hasn't been quantitatively evaluated. Additionally, it's been observed that enabling this option raises the risk of hallucination based on the base OpenAI model. Therefore, it's advised to disable this option to mitigate potential issues
+## Examples:
+transformers:
+```python
+TODO
+```
+openai-whisper:
+```python
+import whisper
+whisper_model = whisper.load_model("converted_models/openai/whisper-large-small-yt-os-V2")
+result = whisper_model.transcribe("long_audio.wav", temperature=0,
+                                  condition_on_previous_text=False,
+                                  language="french", without_timestamps=True, word_timestamps=True)
+```
+faster-whisper:
+```python
+from faster_whisper import WhisperModel
+model = WhisperModel("converted_models/ctranslate2/whisper-large-small-yt-os-V2-fp32", device="cpu", compute_type="float32")
+segments, info = model.transcribe("long_audio.wav",
+                                  without_timestamps=True,
+                                  word_timestamps=True,
+                                  temperature=0,
+                                  condition_on_previous_text=False,
+                                  task="transcribe",
+                                  language="fr")
+```
+Whisper.cpp: works out-of-the-box
+```bash
+ ./main -f long_audio.wav -l fr -mc 0 -m ../converted_models/cpp/ggml-model.bin
+```
+# TODO: Insérer tableau de perfs + liens pour les modèles convertis