naver
/

mHuBERT-147-ASR-fr

Automatic Speech Recognition

Model card Files Files and versions Community

mHuBERT-147-ASR-fr / README.md

mzboito's picture

Update README.md

f5f3de8 verified 3 months ago

|

history blame contribute delete

2.92 kB

metadata

license: cc-by-nc-sa-4.0
base_model: utter-project/mHuBERT-147
datasets:
  - FBK-MT/Speech-MASSIVE
  - FBK-MT/Speech-MASSIVE-test
  - mozilla-foundation/common_voice_17_0
  - google/fleurs
language:
  - fr
metrics:
  - wer
  - cer
pipeline_tag: automatic-speech-recognition

This is a small CTC-based Automatic Speech Recognition system for French.

This model is part of our SLU demo available here: https://huggingface.co/spaces/naver/French-SLU-DEMO-Interspeech2024

Please check our blog post available at: TBD

Training data: 123 hours (84,707 utterances)
Normalization: Whisper normalization

Table of Contents:

Performance

	dev WER	dev CER	test WER	test CER
speechMASSIVE	9.2	2.6	9.6	2.9
fleurs102	20.0	7.0	22.0	7.7
CommonVoice 17	16.0	4.9	19.0	6.5

Training Parameters

This is a mHuBERT-147 ASR fine-tuned model. The training parameters are available in config.json. We highlight the use of 0.3 for hubert.final_dropout, which we found to be very helpful in convergence. We also use fp32 training, as we found fp16 training to be unstable.

ASR Model Class

We use the mHubertForCTC class for our model, which is nearly identical to the existing HubertForCTC class. The key difference is that we've added a few additional hidden layers at the end of the Transformer stack, just before the lm_head. The code is available in CTC_model.py.

Running Inference

The run_inference.py file illustrates how to load the model for inference (load_asr_model), and how to produce transcription for a file (run_asr_inference). Please follow the requirements file to avoid incorrect model loading.

Here is a simple example of the inference loop. Please notice that the sampling rate must be 16,000Hz.

from inference_code.run_inference import load_asr_model, run_asr_inference

model, processor = load_asr_model()

prediction = run_inference(model, processor, your_audio_file)