license: cc-by-nc-sa-4.0
base_model: utter-project/mHuBERT-147
datasets:
- FBK-MT/Speech-MASSIVE
- FBK-MT/Speech-MASSIVE-test
- mozilla-foundation/common_voice_17_0
- google/fleurs
language:
- fr
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
This is a small CTC-based Automatic Speech Recognition system for French.
This model is part of our SLU demo available here: https://huggingface.co/spaces/naver/French-SLU-DEMO-Interspeech2024
Please check our blog post available at: TBD
- Training data: 123 hours (84,707 utterances)
- Normalization: Whisper normalization
Table of Contents:
Performance
dev WER | dev CER | test WER | test CER | |
---|---|---|---|---|
speechMASSIVE | 9.2 | 2.6 | 9.6 | 2.9 |
fleurs102 | 20.0 | 7.0 | 22.0 | 7.7 |
CommonVoice 17 | 16.0 | 4.9 | 19.0 | 6.5 |
Training Parameters
This is a mHuBERT-147 ASR fine-tuned model. The training parameters are available in config.json. We highlight the use of 0.3 for hubert.final_dropout, which we found to be very helpful in convergence. We also use fp32 training, as we found fp16 training to be unstable.
ASR Model Class
We use the mHubertForCTC class for our model, which is nearly identical to the existing HubertForCTC class. The key difference is that we've added a few additional hidden layers at the end of the Transformer stack, just before the lm_head. The code is available in CTC_model.py.
Running Inference
The run_inference.py file illustrates how to load the model for inference (load_asr_model), and how to produce transcription for a file (run_asr_inference). Please follow the requirements file to avoid incorrect model loading.
Here is a simple example of the inference loop. Please notice that the sampling rate must be 16,000Hz.
from inference_code.run_inference import load_asr_model, run_asr_inference
model, processor = load_asr_model()
prediction = run_inference(model, processor, your_audio_file)