iamTangsang's picture
Update README.md
35e610d verified
metadata
library_name: transformers
base_model:
  - facebook/wav2vec2-xls-r-300m
tags:
  - ASR
  - Nepali ASR
  - OpenSLR Nepali
  - Nepali ASR Wav2Vec2
  - XLS-R
datasets:
  - iamTangsang/OpenSLR54-Nepali-ASR
  - mozilla-foundation/common_voice_17_0
metrics:
  - wer
  - cer
pipeline_tag: automatic-speech-recognition
license: mit
language:
  - ne

Wav2Vec2_XLS-R-300m_Nepali_ASR

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on:

Model description

The model is a fine-tuned version of Wav2Vec2 XLS-R 300 million parameters version fine-tuned for Nepali Automatic Speech Recognition. The reported results are on the OpenSLR test split.

  • WER on OpenSLR: 16.82%
  • CER on OpenSLR: 2.72%

Intended uses & limitations

  • Research on Nepali ASR
  • Transcriptions on Nepali audio
  • Further Fine-tuning
  • Limitations:

  • The model is trained on the OpenSLR Nepali ASR dataset which upon inspection was found to be quite noisy and inconsistent.
  • Due to resources limitations, utterances longer than 5 sec have been filtered out from the dataset during training and evaluation.
  • Numerals have been filtered out as well.
  • The vocabulary doesn't contain all the Nepali alphabets.
  • Might perform poorly on audio segments longer than 5 seconds. Or, needs some mechanism to segment audio into 5 seconds chunks which might increase processing time.
  • May struggle with background noises and overlapping speech.

Training and evaluation data

Common Voice v17.0

  • This model has been fine-tuned on OpenSLR-54 (Nepali ASR training dataset) and CommonVoice Corpus v17.0
  • Initially, the model was trained on CommonVoice v17.0 ne-NP which consists of about 2 hours of voice data of which 1 hours have been manually validated.
  • We combined the validated and other split first since the dataset is very small. So, we had a total of 1337 utterances.
  • We have preprocessed the data by removing all punctuations and symbols.
  • Then, we used 80% of the total utterances for training and 10% for evaluation.
  • And, we used the test split consisting of 217 utterances for testing. (It might have been present in the train split as well.)
  • It was trained for 30 epochs. The WER started fluctuating around 37% to 39%.

OpenSLR Nepali ASR training data

  • Then, it was further trained on the larger OpenSLR Nepali ASR training dataset which has 157,000 utterances.
  • Firstly, the numerals were removed as the utterances were inconsistent with transcriptions.
  • And, segments longer than 5 seconds were removed because of resource limitations.
  • Less frequently used 'alphabets' were removed to reduce the vocabulary size.
  • Finally, we ended up with 136083 utterances for whole dataset. The dataset has been uploaded here.
  • 80% was used for training, 10% for evaluation and 10% for testing.

Training procedure

Training on CommonVoice 17.0

The following hyperparameters were used during training:

  • learning_rate: 3e-04
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 400
  • num_epochs: 30
  • mixed_precision_training: Native AMP

Initial Training on OpenSLR-54 for 16 epochs

The following hyperparameters were used:

  • learning_rate: 3e-04
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • warmup_steps: 500
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • num_epochs: 16
  • mixed_precision_training: Native AMP

Further Training on OpenSLR-54 for further 3 epochs

We used the following:

  • learning_rate: 2e-5
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 700
  • num_epochs: 3
  • mixed_precision_training: Native AMP

Framework versions

  • Transformers 4.44.2
  • Pytorch 2.4.1+cu121
  • Datasets 3.0.0
  • Tokenizers 0.19.1