File size: 3,667 Bytes
4e65560 9093ef4 02eba6b 4e65560 9093ef4 4e65560 d2cddb0 7a082e2 d2cddb0 a34d081 4e65560 2a52e61 4e65560 d2cddb0 9093ef4 d2cddb0 9093ef4 d2cddb0 4e65560 1904c22 4e65560 abda790 4f142cf 4e65560 5800c55 e8dfebd 7b46440 e8dfebd 7b46440 e8dfebd 4e65560 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
---
language:
- ia
license: apache-2.0
tags:
- automatic-speech-recognition
- generated_from_trainer
- hf-asr-leaderboard
- robust-speech-event
- mozilla-foundation/common_voice_8_0
- robust-speech-event
datasets:
- mozilla-foundation/common_voice_8_0
model-index:
- name: wav2vec2-large-xls-r-300m-ia
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 8
type: mozilla-foundation/common_voice_8_0
args: ia
metrics:
- name: Test WER using LM
type: wer
value: 8.6074
- name: Test CER using LM
type: cer
value: 2.4147
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# wav2vec2-large-xls-r-300m-ia
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the common_voice dataset.
It achieves the following results on the evaluation set:
- Loss: 0.1452
- Wer: 0.1253
## Training Procedure
Training is conducted in Google Colab, the training notebook provided in the repo
## Training and evaluation data
Language Model Created from texts from processed sentence in train + validation split of dataset (common voice 8.0 for Interlingua)
Evaluation is conducted in Notebook, you can see within the repo "notebook_evaluation_wav2vec2_ia.ipynb"
Test WER without LM
wer = 20.1776 %
cer = 4.7205 %
Test WER using
wer = 8.6074 %
cer = 2.4147 %
evaluation using eval.py
```
huggingface-cli login #login to huggingface for getting auth token to access the common voice v8
#running with LM
python eval.py --model_id ayameRushia/wav2vec2-large-xls-r-300m-ia --dataset mozilla-foundation/common_voice_8_0 --config ia --split test
# running without LM
python eval.py --model_id ayameRushia/wav2vec2-large-xls-r-300m-ia --dataset mozilla-foundation/common_voice_8_0 --config ia --split test --greedy
```
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 16
- eval_batch_size: 4
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 400
- num_epochs: 30
- mixed_precision_training: Native AMP
### Training results
| Training Loss | Epoch | Step | Validation Loss | Wer |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 7.432 | 1.87 | 400 | 2.9636 | 1.0 |
| 2.6922 | 3.74 | 800 | 2.2111 | 0.9977 |
| 1.2581 | 5.61 | 1200 | 0.4864 | 0.4028 |
| 0.6232 | 7.48 | 1600 | 0.2807 | 0.2413 |
| 0.4479 | 9.35 | 2000 | 0.2219 | 0.1885 |
| 0.3654 | 11.21 | 2400 | 0.1886 | 0.1606 |
| 0.323 | 13.08 | 2800 | 0.1716 | 0.1444 |
| 0.2935 | 14.95 | 3200 | 0.1687 | 0.1443 |
| 0.2707 | 16.82 | 3600 | 0.1632 | 0.1382 |
| 0.2559 | 18.69 | 4000 | 0.1507 | 0.1337 |
| 0.2433 | 20.56 | 4400 | 0.1572 | 0.1358 |
| 0.2338 | 22.43 | 4800 | 0.1489 | 0.1305 |
| 0.2258 | 24.3 | 5200 | 0.1485 | 0.1278 |
| 0.2218 | 26.17 | 5600 | 0.1470 | 0.1272 |
| 0.2169 | 28.04 | 6000 | 0.1470 | 0.1270 |
| 0.2117 | 29.91 | 6400 | 0.1452 | 0.1253 |
### Framework versions
- Transformers 4.17.0.dev0
- Pytorch 1.10.0+cu111
- Datasets 1.18.3
- Tokenizers 0.11.0
|