Model description
This model is a fine-tuned version of facebook/wav2vec2-xls-r-1b on my collection of Public Japanese Voice datasets for research Common Voice 7.0, JUST (Japanese speech corpus of Saruwatari-lab., University of Tokyo), JSSS (Japanese speech corpus for summarization and simplification), CSS10 (A collection of single speaker speech datasets). You can find in preprocessing dataset in here VUMICHIEN/COMMON_VOICE_LARGE_JSUT_JSSS_CSS10.
Total training data:
~60 hours
Benchmark WER result:
COMMON VOICE 7.0 | COMMON VOICE 8.0 | |
---|---|---|
without LM | 10.96 | 10.91 |
with 4-grams LM | 7.98 | 7.88 |
Benchmark CER result:
COMMON VOICE 7.0 | COMMON VOICE 8.0 | |
---|---|---|
without LM | 4.28 | 4.22 |
with 4-grams LM | 3.42 | 3.35 |
Evaluation
Please use the eval.py file to run the evaluation:
pip install mecab-python3 unidic-lite pykakasi
python eval.py --model_id vumichien/wav2vec2-xls-r-1b-japanese --dataset mozilla-foundation/common_voice_7_0 --config ja --split test --chunk_length_s 5.0 --stride_length_s 1.0 --log_outputs
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 1000
- num_epochs: 100.0
- mixed_precision_training: Native AMP
Training results
Training Loss | Epoch | Step | Validation Loss | Wer | Cer |
---|---|---|---|---|---|
2.2896 | 3.37 | 1500 | 0.4748 | 0.4013 | 0.1767 |
1.1608 | 6.74 | 3000 | 0.3350 | 0.3159 | 0.1456 |
1.1042 | 10.11 | 4500 | 0.3119 | 0.2971 | 0.1400 |
1.0494 | 13.48 | 6000 | 0.2974 | 0.2867 | 0.1353 |
1.0061 | 16.85 | 7500 | 0.2802 | 0.2746 | 0.1300 |
0.9629 | 20.22 | 9000 | 0.2844 | 0.2776 | 0.1326 |
0.9267 | 23.59 | 10500 | 0.2577 | 0.2603 | 0.1255 |
0.8984 | 26.96 | 12000 | 0.2508 | 0.2531 | 0.1226 |
0.8729 | 30.34 | 13500 | 0.2629 | 0.2606 | 0.1254 |
0.8546 | 33.71 | 15000 | 0.2402 | 0.2447 | 0.1193 |
0.8304 | 37.08 | 16500 | 0.2532 | 0.2472 | 0.1209 |
0.8075 | 40.45 | 18000 | 0.2439 | 0.2469 | 0.1198 |
0.7827 | 43.82 | 19500 | 0.2387 | 0.2372 | 0.1167 |
0.7627 | 47.19 | 21000 | 0.2344 | 0.2331 | 0.1147 |
0.7402 | 50.56 | 22500 | 0.2314 | 0.2299 | 0.1135 |
0.718 | 53.93 | 24000 | 0.2257 | 0.2267 | 0.1114 |
0.7016 | 57.3 | 25500 | 0.2204 | 0.2184 | 0.1089 |
0.6804 | 60.67 | 27000 | 0.2227 | 0.2181 | 0.1085 |
0.6625 | 64.04 | 28500 | 0.2138 | 0.2112 | 0.1058 |
0.6465 | 67.42 | 30000 | 0.2141 | 0.2081 | 0.1044 |
0.6238 | 70.79 | 31500 | 0.2172 | 0.2082 | 0.1050 |
0.6062 | 74.16 | 33000 | 0.2174 | 0.2058 | 0.1043 |
0.588 | 77.53 | 34500 | 0.2156 | 0.2034 | 0.1027 |
0.5722 | 80.9 | 36000 | 0.2162 | 0.2032 | 0.1029 |
0.5585 | 84.27 | 37500 | 0.2156 | 0.2022 | 0.1021 |
0.5456 | 87.64 | 39000 | 0.2126 | 0.1993 | 0.1009 |
0.5325 | 91.01 | 40500 | 0.2121 | 0.1966 | 0.1003 |
0.5229 | 94.38 | 42000 | 0.2104 | 0.1941 | 0.0991 |
0.5134 | 97.75 | 43500 | 0.2108 | 0.1948 | 0.0992 |
Framework versions
- Transformers 4.16.0.dev0
- Pytorch 1.10.1+cu102
- Datasets 1.17.1.dev0
- Tokenizers 0.11.0
- Downloads last month
- 32
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Dataset used to train vumichien/wav2vec2-xls-r-1b-japanese
Evaluation results
- Test WER (with LM) on Common Voice 7.0self-reported7.980
- Test CER (with LM) on Common Voice 7.0self-reported3.420
- Test WER (with LM) on Common Voice 8.0self-reported7.880
- Test CER (with LM) on Common Voice 8.0self-reported3.350
- Test WER (with LM) on Robust Speech Event - Dev Dataself-reported28.070
- Test CER (with LM) on Robust Speech Event - Dev Dataself-reported16.270
- Test CER on Robust Speech Event - Test Dataself-reported19.890