File size: 2,742 Bytes
1ac22db bd7057e 7de3b2c bd7057e 7de3b2c bd7057e 7de3b2c 1ac22db a5706d8 a3cc419 a5706d8 3523b0e 3f87c4d a3cc419 1df7a64 34edfbb a3cc419 d97ce4d a3cc419 d97ce4d 8072f41 3f87c4d 545584b 3f87c4d 649a76a a5706d8 3687bc0 bf35794 3687bc0 7de3b2c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
---
base_model: facebook/w2v-bert-2.0
datasets:
- common_voice_10_0
metrics:
- wer
model-index:
- name: w2v-bert-2.0-uk
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: common_voice_10_0
type: common_voice_10_0
config: uk
split: test
args: uk
metrics:
- name: Wer
type: wer
value: 0.0655
license: apache-2.0
---
🚨🚨🚨 **ATTENTION!** 🚨🚨🚨
**Use an updated model**: https://huggingface.co/Yehor/w2v-bert-uk-v2.1
---
# w2v-bert-uk `v1`
## Community
- **Discord**: https://discord.gg/yVAjkBgmt4
- Speech Recognition: https://t.me/speech_recognition_uk
- Speech Synthesis: https://t.me/speech_synthesis_uk
## Google Colab
You can run this model using a Google Colab notebook: https://colab.research.google.com/drive/1QoKw2DWo5a5XYw870cfGE3dJf1WjZgrj?usp=sharing
## Metrics
- AM:
- WER: 0.0727
- CER: 0.0151
- Accuracy: 92.73%
- AM + LM:
- WER: 0.0655
- CER: 0.0139
- Accuracy: 93.45%
## Hyperparameters
This model was trained with the following hparams using 2 RTX A4000:
```
torchrun --standalone --nnodes=1 --nproc-per-node=2 ../train_w2v2_bert.py \
--custom_set ~/cv10/train.csv \
--custom_set_eval ~/cv10/test.csv \
--num_train_epochs 15 \
--tokenize_config . \
--w2v2_bert_model facebook/w2v-bert-2.0 \
--batch 4 \
--num_proc 5 \
--grad_accum 1 \
--learning_rate 3e-5 \
--logging_steps 20 \
--eval_step 500 \
--group_by_length \
--attention_dropout 0.0 \
--activation_dropout 0.05 \
--feat_proj_dropout 0.05 \
--feat_quantizer_dropout 0.0 \
--hidden_dropout 0.05 \
--layerdrop 0.0 \
--final_dropout 0.0 \
--mask_time_prob 0.0 \
--mask_time_length 10 \
--mask_feature_prob 0.0 \
--mask_feature_length 10
```
## Usage
```python
# pip install -U torch soundfile transformers
import torch
import soundfile as sf
from transformers import AutoModelForCTC, Wav2Vec2BertProcessor
# Config
model_name = 'Yehor/w2v-bert-2.0-uk'
device = 'cuda:1' # or cpu
sampling_rate = 16_000
# Load the model
asr_model = AutoModelForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2BertProcessor.from_pretrained(model_name)
paths = [
'sample1.wav',
]
# Extract audio
audio_inputs = []
for path in paths:
audio_input, _ = sf.read(path)
audio_inputs.append(audio_input)
# Transcribe the audio
inputs = processor(audio_inputs, sampling_rate=sampling_rate).input_features
features = torch.tensor(inputs).to(device)
with torch.no_grad():
logits = asr_model(features).logits
predicted_ids = torch.argmax(logits, dim=-1)
predictions = processor.batch_decode(predicted_ids)
# Log results
print('Predictions:')
print(predictions)
``` |