File size: 4,245 Bytes

820d04f
 
 
 
 
 
 
3a473cf
f2bdd30
 
 
 
 
 
820d04f
 
f48d0ef
820d04f
f48d0ef
 
 
820d04f
f48d0ef
 
 
f2bdd30
 
 
 
f48d0ef
f2bdd30
 
 
 
 
 
f48d0ef
f2bdd30
f48d0ef
 
f2bdd30
f48d0ef
f2bdd30
 
 
 
 
 
f48d0ef
 
 
 
 
f2bdd30
f48d0ef
 
 
 
f2bdd30
f48d0ef
 
 
 
f2bdd30
820d04f
f48d0ef
f2bdd30
820d04f
 
f48d0ef
820d04f
f2bdd30
820d04f
f2bdd30
 
 
 
 
820d04f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a473cf
820d04f
 
 
 
 
124ec62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
820d04f
 
 
 
3a473cf
 
 
f2bdd30

---
license: apache-2.0
base_model: facebook/wav2vec2-large-xlsr-53
metrics:
- wer
model-index:
- name: wav2vec2-xlsr-53-ft-ccv-en-cy
  results: []
datasets:
- techiaith/commonvoice_16_1_en_cy
language:
- cy
- en
pipeline_tag: automatic-speech-recognition
---

# wav2vec2-xlsr-53-ft-cy-en-withlm

This model is a version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
that has been fined-tuned with a custom bilingual datasets derived from the Welsh 
and English data releases of Mozilla Foundation's Commonvoice project. See : [techiaith/commonvoice_16_1_en_cy](https://huggingface.co/datasets/techiaith/commonvoice_16_1_en_cy).

In addition, this model also includes a single KenLM n-gram model trained with balanced 
collections of Welsh and English texts from [OSCAR](https://huggingface.co/datasets/oscar) 
This avoids the need for any language detection for determining whether to use a Welsh or English n-gram models during CTC decoding. 


## Usage

The `wav2vec2-xlsr-53-ft-cy-en-withlm` model can be used directly as follows:

```python
import torch
import torchaudio
import librosa

from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM

processor = Wav2Vec2ProcessorWithLM.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")

audio, rate = librosa.load(<path/to/audio_file>, sr=16000)

inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

print("Prediction: ", processor.batch_decode(tlogits.numpy(), beam_width=10).text[0].strip())

```

Usage with a pipeline is even simpler...

```
from transformers import pipeline

transcriber = pipeline("automatic-speech-recognition", model="techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")

def transcribe(audio):
    return transcriber(audio)["text"]

transcribe(<path/or/url/to/any/audiofile>)
```


## Evaluation


According to a balanced English+Welsh test set derived from Common Voice version 16.1, the WER of techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm is **23.79%**

However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.  

| Common Voice Test Set Language | WER | CER | 
| -------- | --- | --- | 
| EN+CY | 23.79| 9.68  | 
| EN | 34.47  | 14.83  |
| CY | 12.34  | 3.55  |


## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0003
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 800
- training_steps: 9000
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step | Validation Loss | Wer    |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 6.0574        | 0.25  | 500  | 2.0297          | 0.9991 |
| 1.224         | 0.5   | 1000 | 0.5368          | 0.4342 |
| 0.434         | 0.75  | 1500 | 0.4861          | 0.3891 |
| 0.3295        | 1.01  | 2000 | 0.4301          | 0.3411 |
| 0.2739        | 1.26  | 2500 | 0.3818          | 0.3053 |
| 0.2619        | 1.51  | 3000 | 0.3894          | 0.3060 |
| 0.2517        | 1.76  | 3500 | 0.3497          | 0.2802 |
| 0.2244        | 2.01  | 4000 | 0.3519          | 0.2792 |
| 0.1854        | 2.26  | 4500 | 0.3376          | 0.2718 |
| 0.1779        | 2.51  | 5000 | 0.3206          | 0.2520 |
| 0.1749        | 2.77  | 5500 | 0.3169          | 0.2535 |
| 0.1636        | 3.02  | 6000 | 0.3122          | 0.2465 |
| 0.137         | 3.27  | 6500 | 0.3054          | 0.2382 |
| 0.1311        | 3.52  | 7000 | 0.2956          | 0.2280 |
| 0.1261        | 3.77  | 7500 | 0.2898          | 0.2236 |
| 0.1187        | 4.02  | 8000 | 0.2847          | 0.2176 |
| 0.1011        | 4.27  | 8500 | 0.2763          | 0.2124 |
| 0.0981        | 4.52  | 9000 | 0.2754          | 0.2115 |


### Framework versions

- Transformers 4.38.2
- Pytorch 2.2.1+cu121
- Datasets 2.18.0
- Tokenizers 0.15.2