metadata

license: apache-2.0
base_model: facebook/wav2vec2-large-xlsr-53
metrics:
  - wer
model-index:
  - name: wav2vec2-xlsr-53-ft-ccv-en-cy
    results: []
datasets:
  - techiaith/commonvoice_16_1_en_cy
language:
  - cy
  - en
pipeline_tag: automatic-speech-recognition

wav2vec2-xlsr-53-ft-cy-en-withlm

This model is a version of facebook/wav2vec2-large-xlsr-53 that has been fined-tuned with a custom bilingual datasets derived from the Welsh and English data releases of Mozilla Foundation's Commonvoice project. See : techiaith/commonvoice_16_1_en_cy.

In addition, this model also includes a single KenLM n-gram model trained with balanced collections of Welsh and English texts from OSCAR This avoids the need for any language detection for determining whether to use a Welsh or English n-gram models during CTC decoding.

Usage

The wav2vec2-xlsr-53-ft-cy-en-withlm model can be used directly as follows:

import torch
import torchaudio
import librosa

from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM

processor = Wav2Vec2ProcessorWithLM.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")

audio, rate = librosa.load(<path/to/audio_file>, sr=16000)

inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

print("Prediction: ", processor.batch_decode(tlogits.numpy(), beam_width=10).text[0].strip())

Usage with a pipeline is even simpler...

from transformers import pipeline

transcriber = pipeline("automatic-speech-recognition", model="techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")

def transcribe(audio):
    return transcriber(audio)["text"]

transcribe(<path/or/url/to/any/audiofile>)

Evaluation

According to a balanced English+Welsh test set derived from Common Voice version 16.1, the WER of techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm is 23.79%

However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.

Common Voice Test Set Language	WER	CER
EN+CY	23.79	9.68
EN	34.47	14.83
CY	12.34	3.55

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 32
eval_batch_size: 32
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 800
training_steps: 9000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
6.0574	0.25	500	2.0297	0.9991
1.224	0.5	1000	0.5368	0.4342
0.434	0.75	1500	0.4861	0.3891
0.3295	1.01	2000	0.4301	0.3411
0.2739	1.26	2500	0.3818	0.3053
0.2619	1.51	3000	0.3894	0.3060
0.2517	1.76	3500	0.3497	0.2802
0.2244	2.01	4000	0.3519	0.2792
0.1854	2.26	4500	0.3376	0.2718
0.1779	2.51	5000	0.3206	0.2520
0.1749	2.77	5500	0.3169	0.2535
0.1636	3.02	6000	0.3122	0.2465
0.137	3.27	6500	0.3054	0.2382
0.1311	3.52	7000	0.2956	0.2280
0.1261	3.77	7500	0.2898	0.2236
0.1187	4.02	8000	0.2847	0.2176
0.1011	4.27	8500	0.2763	0.2124
0.0981	4.52	9000	0.2754	0.2115

Framework versions

Transformers 4.38.2
Pytorch 2.2.1+cu121
Datasets 2.18.0
Tokenizers 0.15.2