metadata
library_name: transformers
base_model:
- facebook/wav2vec2-xls-r-300m
tags:
- ASR
- Nepali ASR
- OpenSLR Nepali
- Nepali ASR Wav2Vec2
- XLS-R
datasets:
- iamTangsang/OpenSLR54-Nepali-ASR
- mozilla-foundation/common_voice_17_0
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
license: mit
language:
- ne
Wav2Vec2_XLS-R-300m_Nepali_ASR
This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on:
- [Large Nepali ASR training data set from OpenSLR (SLR-54)] (https://www.openslr.org/54/)
- [Common Voice Corpus 17.0] (https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
Model description
The model is a fine-tuned version of Wav2Vec2 XLS-R 300 million parameters version fine-tuned for Nepali Automatic Speech Recognition. The reported results are on the OpenSLR test split.
- WER on OpenSLR: 16.82%
- CER on OpenSLR: 2.72%
Intended uses & limitations
- Research on Nepali ASR
- Transcriptions on Nepali audio
- Further Fine-tuning
Limitations:
- The model is trained on the OpenSLR Nepali ASR dataset which upon inspection was found to be quite noisy and inconsistent.
- Due to resources limitations, utterances longer than 5 sec have been filtered out from the dataset during training and evaluation.
- Numerals have been filtered out as well.
- The vocabulary doesn't contain all the Nepali alphabets.
- Might perform poorly on audio segments longer than 5 seconds. Or, needs some mechanism to segment audio into 5 seconds chunks which might increase processing time.
- May struggle with background noises and overlapping speech.
Training and evaluation data
Common Voice v17.0
- This model has been fine-tuned on OpenSLR-54 (Nepali ASR training dataset) and CommonVoice Corpus v17.0
- Initially, the model was trained on CommonVoice v17.0 ne-NP which consists of about 2 hours of voice data of which 1 hours have been manually validated.
- We combined the
validated
andother
split first since the dataset is very small. So, we had a total of 1337 utterances. - We have preprocessed the data by removing all punctuations and symbols.
- Then, we used 80% of the total utterances for training and 10% for evaluation.
- And, we used the
test
split consisting of 217 utterances for testing. (It might have been present in thetrain split
as well.) - It was trained for 30 epochs. The WER started fluctuating around 37% to 39%.
OpenSLR Nepali ASR training data
- Then, it was further trained on the larger OpenSLR Nepali ASR training dataset which has 157,000 utterances.
- Firstly, the numerals were removed as the utterances were inconsistent with transcriptions.
- And, segments longer than 5 seconds were removed because of resource limitations.
- Less frequently used 'alphabets' were removed to reduce the vocabulary size.
- Finally, we ended up with 136083 utterances for whole dataset. The dataset has been uploaded here.
- 80% was used for training, 10% for evaluation and 10% for testing.
Training procedure
Training on CommonVoice 17.0
The following hyperparameters were used during training:
- learning_rate: 3e-04
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 400
- num_epochs: 30
- mixed_precision_training: Native AMP
Initial Training on OpenSLR-54 for 16 epochs
The following hyperparameters were used:
- learning_rate: 3e-04
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- warmup_steps: 500
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 16
- mixed_precision_training: Native AMP
Further Training on OpenSLR-54 for further 3 epochs
We used the following:
- learning_rate: 2e-5
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 700
- num_epochs: 3
- mixed_precision_training: Native AMP
Framework versions
- Transformers 4.44.2
- Pytorch 2.4.1+cu121
- Datasets 3.0.0
- Tokenizers 0.19.1