|
--- |
|
license: cc-by-nc-sa-4.0 |
|
base_model: utter-project/mHuBERT-147 |
|
datasets: |
|
- FBK-MT/Speech-MASSIVE |
|
- FBK-MT/Speech-MASSIVE-test |
|
- mozilla-foundation/common_voice_17_0 |
|
- google/fleurs |
|
language: |
|
- fr |
|
metrics: |
|
- wer |
|
- cer |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
**This is a small CTC-based Automatic Speech Recognition system for French.** |
|
|
|
This model is part of our SLU demo available here: https://huggingface.co/spaces/naver/French-SLU-DEMO-Interspeech2024 |
|
|
|
Please check our blog post available at: TBD |
|
|
|
* Training data: 123 hours (84,707 utterances) |
|
* Normalization: Whisper normalization |
|
|
|
# Table of Contents: |
|
1. [Performance](https://huggingface.co/naver/mHuBERT-147-ASR-fr#performance) |
|
2. [Training Parameters](https://huggingface.co/naver/mHuBERT-147-ASR-fr#training-parameters) |
|
3. [ASR Model class](https://huggingface.co/naver/mHuBERT-147-ASR-fr#asr-model-class) |
|
4. [Running inference](https://huggingface.co/naver/mHuBERT-147-ASR-fr#running-inference) |
|
|
|
## Performance |
|
|
|
| | **dev WER** | **dev CER** | **test WER** | **test CER** | |
|
|:------------------:|:-----------:|:-----------:|:------------:|:------------:| |
|
| **speechMASSIVE** | 9.2 | 2.6 | 9.6 | 2.9 | |
|
| **fleurs102** | 20.0 | 7.0 | 22.0 | 7.7 | |
|
| **CommonVoice 17** | 16.0 | 4.9 | 19.0 | 6.5 | |
|
|
|
## Training Parameters |
|
|
|
This is a [mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147) ASR fine-tuned model. |
|
The training parameters are available in [config.json](https://huggingface.co/naver/mHuBERT-147-ASR-fr/blob/main/config.json). |
|
We highlight the use of 0.3 for hubert.final_dropout, which we found to be very helpful in convergence. We also use fp32 training, as we found fp16 training to be unstable. |
|
|
|
## ASR Model Class |
|
|
|
We use the mHubertForCTC class for our model, which is nearly identical to the existing HubertForCTC class. |
|
The key difference is that we've added a few additional hidden layers at the end of the Transformer stack, just before the lm_head. |
|
The code is available in [CTC_model.py](https://huggingface.co/naver/mHuBERT-147-ASR-fr/blob/main/inference_code/CTC_model.py). |
|
|
|
## Running Inference |
|
|
|
The [run_inference.py](https://huggingface.co/naver/mHuBERT-147-ASR-fr/blob/main/inference_code/run_inference.py) file illustrates how to load the model for inference (**load_asr_model**), and how to produce transcription for a file (**run_asr_inference**). |
|
Please follow the [requirements file](https://huggingface.co/naver/mHuBERT-147-ASR-fr/blob/main/requirements.txt) to avoid incorrect model loading. |
|
|
|
Here is a simple example of the inference loop. Please notice that the sampling rate must be 16,000Hz. |
|
|
|
``` |
|
from inference_code.run_inference import load_asr_model, run_asr_inference |
|
|
|
model, processor = load_asr_model() |
|
|
|
prediction = run_inference(model, processor, your_audio_file) |
|
|
|
``` |