File size: 11,157 Bytes
1c912cb ee3f416 d8abd15 1c912cb bcb47ae 1c912cb f0b9dcc d3ceacf f0b9dcc 04d2ca4 f0b9dcc d3ceacf 7f1cdb8 d3ceacf 1c912cb c6ae595 19ee221 6563424 efc16b9 ea5cba6 1c912cb a63e059 69de36e ea5cba6 1c912cb b1a57f6 1c912cb c6ae595 1c912cb 1cd8351 1c912cb ea5cba6 1c912cb 4140359 1c912cb 4140359 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 |
---
language:
- ro
license: apache-2.0
tags:
- automatic-speech-recognition
- hf-asr-leaderboard
- robust-speech-event
datasets:
- mozilla-foundation/common_voice_8_0
- gigant/romanian_speech_synthesis_0_8_1
model-index:
- name: wav2vec2-ro-300m_01
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Robust Speech Event
type: speech-recognition-community-v2/dev_data
args: ro
metrics:
- name: Dev WER (without LM)
type: wer
value: 46.99
- name: Dev CER (without LM)
type: cer
value: 16.04
- name: Dev WER (with LM)
type: wer
value: 38.63
- name: Dev CER (with LM)
type: cer
value: 14.52
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice
type: mozilla-foundation/common_voice_8_0
args: ro
metrics:
- name: Test WER (without LM)
type: wer
value: 11.73
- name: Test CER (without LM)
type: cer
value: 2.93
- name: Test WER (with LM)
type: wer
value: 7.31
- name: Test CER (with LM)
type: cer
value: 2.17
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Robust Speech Event - Test Data
type: speech-recognition-community-v2/eval_data
args: ro
metrics:
- name: Test WER
type: wer
value: 43.23
---
You can test this model online with the [**Space for Romanian Speech Recognition**](https://huggingface.co/spaces/gigant/romanian-speech-recognition)
The model ranked **TOP-1** on Romanian Speech Recognition during HuggingFace's Robust Speech Challenge :
* [**The 🤗 Speech Bench**](https://huggingface.co/spaces/huggingface/hf-speech-bench)
* [**Speech Challenge Leaderboard**](https://huggingface.co/spaces/speech-recognition-community-v2/FinalLeaderboard)
# Romanian Wav2Vec2
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the [Common Voice 8.0 - Romanian subset](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) dataset, with extra training data from [Romanian Speech Synthesis](https://huggingface.co/datasets/gigant/romanian_speech_synthesis_0_8_1) dataset.
Without the 5-gram Language Model optimization, it achieves the following results on the evaluation set (Common Voice 8.0, Romanian subset, test split):
- Loss: 0.1553
- Wer: 0.1174
- Cer: 0.0294
## Model description
The architecture is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) with a speech recognition CTC head and an added 5-gram language model (using [pyctcdecode](https://github.com/kensho-technologies/pyctcdecode) and [kenlm](https://github.com/kpu/kenlm)) trained on the [Romanian Corpora Parliament](gigant/ro_corpora_parliament_processed) dataset. Those libraries are needed in order for the language model-boosted decoder to work.
## Intended uses & limitations
The model is made for speech recognition in Romanian from audio clips sampled at **16kHz**. The predicted text is lowercased and does not contain any punctuation.
## How to use
Make sure you have installed the correct dependencies for the language model-boosted version to work. You can just run this command to install the `kenlm` and `pyctcdecode` libraries :
```pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode```
With the framework `transformers` you can load the model with the following code :
```
from transformers import AutoProcessor, AutoModelForCTC
processor = AutoProcessor.from_pretrained("gigant/romanian-wav2vec2")
model = AutoModelForCTC.from_pretrained("gigant/romanian-wav2vec2")
```
Or, if you want to test the model, you can load the automatic speech recognition pipeline from `transformers` with :
```
from transformers import pipeline
asr = pipeline("automatic-speech-recognition", model="gigant/romanian-wav2vec2")
```
## Example use with the `datasets` library
First, you need to load your data
We will use the [Romanian Speech Synthesis](https://huggingface.co/datasets/gigant/romanian_speech_synthesis_0_8_1) dataset in this example.
```
from datasets import load_dataset
dataset = load_dataset("gigant/romanian_speech_synthesis_0_8_1")
```
You can listen to the samples with the `IPython.display` library :
```
from IPython.display import Audio
i = 0
sample = dataset["train"][i]
Audio(sample["audio"]["array"], rate = sample["audio"]["sampling_rate"])
```
The model is trained to work with audio sampled at 16kHz, so if the sampling rate of the audio in the dataset is different, we will have to resample it.
In the example, the audio is sampled at 48kHz. We can see this by checking `dataset["train"][0]["audio"]["sampling_rate"]`
The following code resample the audio using the `torchaudio` library :
```
import torchaudio
import torch
i = 0
audio = sample["audio"]["array"]
rate = sample["audio"]["sampling_rate"]
resampler = torchaudio.transforms.Resample(rate, 16_000)
audio_16 = resampler(torch.Tensor(audio)).numpy()
```
To listen to the resampled sample :
```
Audio(audio_16, rate=16000)
```
Know you can get the model prediction by running
```
predicted_text = asr(audio_16)
ground_truth = dataset["train"][i]["sentence"]
print(f"Predicted text : {predicted_text}")
print(f"Ground truth : {ground_truth}")
```
## Training and evaluation data
Training data :
- [Common Voice 8.0 - Romanian subset](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) : train + validation + other splits
- [Romanian Speech Synthesis](https://huggingface.co/datasets/gigant/romanian_speech_synthesis_0_8_1) : train + test splits
Evaluation data :
- [Common Voice 8.0 - Romanian subset](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) : test split
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.003
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 3
- total_train_batch_size: 48
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 50.0
- mixed_precision_training: Native AMP
### Training results
| Training Loss | Epoch | Step | Validation Loss | Wer | Cer |
|:-------------:|:-----:|:-----:|:---------------:|:------:|:------:|
| 2.9272 | 0.78 | 500 | 0.7603 | 0.7734 | 0.2355 |
| 0.6157 | 1.55 | 1000 | 0.4003 | 0.4866 | 0.1247 |
| 0.4452 | 2.33 | 1500 | 0.2960 | 0.3689 | 0.0910 |
| 0.3631 | 3.11 | 2000 | 0.2580 | 0.3205 | 0.0796 |
| 0.3153 | 3.88 | 2500 | 0.2465 | 0.2977 | 0.0747 |
| 0.2795 | 4.66 | 3000 | 0.2274 | 0.2789 | 0.0694 |
| 0.2615 | 5.43 | 3500 | 0.2277 | 0.2685 | 0.0675 |
| 0.2389 | 6.21 | 4000 | 0.2135 | 0.2518 | 0.0627 |
| 0.2229 | 6.99 | 4500 | 0.2054 | 0.2449 | 0.0614 |
| 0.2067 | 7.76 | 5000 | 0.2096 | 0.2378 | 0.0597 |
| 0.1977 | 8.54 | 5500 | 0.2042 | 0.2387 | 0.0600 |
| 0.1896 | 9.32 | 6000 | 0.2110 | 0.2383 | 0.0595 |
| 0.1801 | 10.09 | 6500 | 0.1909 | 0.2165 | 0.0548 |
| 0.174 | 10.87 | 7000 | 0.1883 | 0.2206 | 0.0559 |
| 0.1685 | 11.65 | 7500 | 0.1848 | 0.2097 | 0.0528 |
| 0.1591 | 12.42 | 8000 | 0.1851 | 0.2039 | 0.0514 |
| 0.1537 | 13.2 | 8500 | 0.1881 | 0.2065 | 0.0518 |
| 0.1504 | 13.97 | 9000 | 0.1840 | 0.1972 | 0.0499 |
| 0.145 | 14.75 | 9500 | 0.1845 | 0.2029 | 0.0517 |
| 0.1417 | 15.53 | 10000 | 0.1884 | 0.2003 | 0.0507 |
| 0.1364 | 16.3 | 10500 | 0.2010 | 0.2037 | 0.0517 |
| 0.1331 | 17.08 | 11000 | 0.1838 | 0.1923 | 0.0483 |
| 0.129 | 17.86 | 11500 | 0.1818 | 0.1922 | 0.0489 |
| 0.1198 | 18.63 | 12000 | 0.1760 | 0.1861 | 0.0465 |
| 0.1203 | 19.41 | 12500 | 0.1686 | 0.1839 | 0.0465 |
| 0.1225 | 20.19 | 13000 | 0.1828 | 0.1920 | 0.0479 |
| 0.1145 | 20.96 | 13500 | 0.1673 | 0.1784 | 0.0446 |
| 0.1053 | 21.74 | 14000 | 0.1802 | 0.1810 | 0.0456 |
| 0.1071 | 22.51 | 14500 | 0.1769 | 0.1775 | 0.0444 |
| 0.1053 | 23.29 | 15000 | 0.1920 | 0.1783 | 0.0457 |
| 0.1024 | 24.07 | 15500 | 0.1904 | 0.1775 | 0.0446 |
| 0.0987 | 24.84 | 16000 | 0.1793 | 0.1762 | 0.0446 |
| 0.0949 | 25.62 | 16500 | 0.1801 | 0.1766 | 0.0443 |
| 0.0942 | 26.4 | 17000 | 0.1731 | 0.1659 | 0.0423 |
| 0.0906 | 27.17 | 17500 | 0.1776 | 0.1698 | 0.0424 |
| 0.0861 | 27.95 | 18000 | 0.1716 | 0.1600 | 0.0406 |
| 0.0851 | 28.73 | 18500 | 0.1662 | 0.1630 | 0.0410 |
| 0.0844 | 29.5 | 19000 | 0.1671 | 0.1572 | 0.0393 |
| 0.0792 | 30.28 | 19500 | 0.1768 | 0.1599 | 0.0407 |
| 0.0798 | 31.06 | 20000 | 0.1732 | 0.1558 | 0.0394 |
| 0.0779 | 31.83 | 20500 | 0.1694 | 0.1544 | 0.0388 |
| 0.0718 | 32.61 | 21000 | 0.1709 | 0.1578 | 0.0399 |
| 0.0732 | 33.38 | 21500 | 0.1697 | 0.1523 | 0.0391 |
| 0.0708 | 34.16 | 22000 | 0.1616 | 0.1474 | 0.0375 |
| 0.0678 | 34.94 | 22500 | 0.1698 | 0.1474 | 0.0375 |
| 0.0642 | 35.71 | 23000 | 0.1681 | 0.1459 | 0.0369 |
| 0.0661 | 36.49 | 23500 | 0.1612 | 0.1411 | 0.0357 |
| 0.0629 | 37.27 | 24000 | 0.1662 | 0.1414 | 0.0355 |
| 0.0587 | 38.04 | 24500 | 0.1659 | 0.1408 | 0.0351 |
| 0.0581 | 38.82 | 25000 | 0.1612 | 0.1382 | 0.0352 |
| 0.0556 | 39.6 | 25500 | 0.1647 | 0.1376 | 0.0345 |
| 0.0543 | 40.37 | 26000 | 0.1658 | 0.1335 | 0.0337 |
| 0.052 | 41.15 | 26500 | 0.1716 | 0.1369 | 0.0343 |
| 0.0513 | 41.92 | 27000 | 0.1600 | 0.1317 | 0.0330 |
| 0.0491 | 42.7 | 27500 | 0.1671 | 0.1311 | 0.0328 |
| 0.0463 | 43.48 | 28000 | 0.1613 | 0.1289 | 0.0324 |
| 0.0468 | 44.25 | 28500 | 0.1599 | 0.1260 | 0.0315 |
| 0.0435 | 45.03 | 29000 | 0.1556 | 0.1232 | 0.0308 |
| 0.043 | 45.81 | 29500 | 0.1588 | 0.1240 | 0.0309 |
| 0.0421 | 46.58 | 30000 | 0.1567 | 0.1217 | 0.0308 |
| 0.04 | 47.36 | 30500 | 0.1533 | 0.1198 | 0.0302 |
| 0.0389 | 48.14 | 31000 | 0.1582 | 0.1185 | 0.0297 |
| 0.0387 | 48.91 | 31500 | 0.1576 | 0.1187 | 0.0297 |
| 0.0376 | 49.69 | 32000 | 0.1560 | 0.1182 | 0.0295 |
### Framework versions
- Transformers 4.16.2
- Pytorch 1.10.0+cu111
- Tokenizers 0.11.0
- pyctcdecode 0.3.0
- kenlm
|