File size: 3,154 Bytes
840f0db 685771b 7c1a1a8 840f0db 685771b 876f01b 7c1a1a8 876f01b 7c1a1a8 a7d3521 7c1a1a8 4fdbf8d 7c1a1a8 4fdbf8d 7c1a1a8 a7d3521 7c1a1a8 840f0db 685771b 840f0db 685771b 840f0db 685771b 840f0db b3acab6 840f0db b3acab6 840f0db a7d3521 840f0db b3acab6 2a0e2c3 b3acab6 2a0e2c3 b3acab6 2a0e2c3 b3acab6 840f0db 2a0e2c3 840f0db 685771b 840f0db a7d3521 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
language:
- ro
license: apache-2.0
tags:
- whisper-event
datasets:
- mozilla-foundation/common_voice_11_0
- gigant/romanian_speech_synthesis_0_8_1
metrics:
- wer
pinned: true
base_model: openai/whisper-medium
model-index:
- name: Whisper Medium Romanian
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: mozilla-foundation/common_voice_11_0 ro
type: mozilla-foundation/common_voice_11_0
config: ro
split: test
args: ro
metrics:
- type: wer
value: 4.73
name: Wer
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: google/fleurs ro
type: google/fleurs
config: ro
split: test
args: ro
metrics:
- type: wer
value: 19.64
name: Wer
---
# Whisper Medium Romanian
This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the Common Voice 11.0 dataset, and the Romanian speech synthesis corpus.
It achieves the following results on the evaluation set:
- eval_loss: 0.06453
- eval_wer: 4.717
- epoch: 7.03
- step: 3500
## Model description
The architecture is the same as [openai/whisper-medium](https://huggingface.co/openai/whisper-medium).
## Training and evaluation data
The model was trained on the Common Voice 11.0 dataset (`train+validation+other` splits) and the Romanian speech synthesis corpus, and was tested on the `test` split of the Common Voice 11.0 dataset.
## Usage
Inference with 🤗 transformers
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset
import torch
# load model and processor
processor = WhisperProcessor.from_pretrained("gigant/whisper-medium-romanian")
model = WhisperForConditionalGeneration.from_pretrained("gigant/whisper-medium-romanian")
# load dummy dataset and read soundfiles
ds = load_dataset("common_voice", "ro", split="test", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
input_speech = next(iter(ds))["audio"]["array"]
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "ro", task = "transcribe")
input_features = processor(input_speech, return_tensors="pt", sampling_rate=16_000).input_features
predicted_ids = model.generate(input_features, max_length=448)
# transcription = processor.batch_decode(predicted_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens = True)
```
The code was adapted from [openai/whisper-medium](https://huggingface.co/openai/whisper-medium).
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 5000
- mixed_precision_training: Native AMP
### Framework versions
- Transformers 4.26.0.dev0
- Pytorch 1.13.0+cu117
- Datasets 2.7.1.dev0
- Tokenizers 0.13.2 |