File size: 5,312 Bytes
b5d0989 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
---
language:
- en
- zh
- de
- es
- ru
- ko
- fr
- ja
- pt
- tr
- pl
- ca
- nl
- ar
- sv
- it
- id
- hi
- fi
- vi
- he
- uk
- el
- ms
- cs
- ro
- da
- hu
- ta
- 'no'
- th
- ur
- hr
- bg
- lt
- la
- mi
- ml
- cy
- sk
- te
- fa
- lv
- bn
- sr
- az
- sl
- kn
- et
- mk
- br
- eu
- is
- hy
- ne
- mn
- bs
- kk
- sq
- sw
- gl
- mr
- pa
- si
- km
- sn
- yo
- so
- af
- oc
- ka
- be
- tg
- sd
- gu
- am
- yi
- lo
- uz
- fo
- ht
- ps
- tk
- nn
- mt
- sa
- lb
- my
- bo
- tl
- mg
- as
- tt
- haw
- ln
- ha
- ba
- jw
- su
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
widget:
- example_title: Librispeech sample 1
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: Librispeech sample 2
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- ivrit-ai/whisper-training
---
# Whisper
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.
More details about it are available [here](https://huggingface.co/openai/whisper-large-v2).
**whisper-v2-d3-e3** is a version of whisper-large-v2, fine-tuned by [ivrit.ai](https://www.ivrit.ai) to improve Hebrew ASR using crowd-sourced labeling.
## Model details
This model comes as a single checkpoint, whisper-v2-d3-e3.
It is a 1550M parameters multi-lingual ASR solution.
# Usage
To transcribe audio samples, the model has to be used alongside a [`WhisperProcessor`](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperProcessor).
```python
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
SAMPLING_RATE = 16000
has_cuda = torch.cuda.is_available()
model_path = 'ivrit-ai/whisper-v2-d3-e3'
model = WhisperForConditionalGeneration.from_pretrained(model_path)
if has_cuda:
model.to('cuda:0')
processor = WhisperProcessor.from_pretrained(model_path)
# audio_resample based on entry being part of an existing dataset.
# Alternatively, this can be loaded from an audio file.
audio_resample = librosa.resample(entry['audio']['array'], orig_sr=entry['audio']['sampling_rate'], target_sr=SAMPLING_RATE)
input_features = processor(audio_resample, sampling_rate=SAMPLING_RATE, return_tensors="pt").input_features
if has_cuda:
input_features = input_features.to('cuda:0')
predicted_ids = model.generate(input_features, language='he', num_beams=5)
transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(f'Transcript: {transcription[0]}')
```
## Evaluation
You can use the [evaluate_model.py](https://github.com/yairl/ivrit.ai/blob/master/evaluate_model.py) reference on GitHub to evalute the model's quality.
## Long-Form Transcription
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
```python
>>> import torch
>>> from transformers import pipeline
>>> from datasets import load_dataset
>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
>>> pipe = pipeline(
>>> "automatic-speech-recognition",
>>> model="ivrit-ai/whisper-v2-d3-e3",
>>> chunk_length_s=30,
>>> device=device,
>>> )
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]
>>> prediction = pipe(sample.copy(), batch_size=8)["text"]
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
>>> # we can also return timestamps for the predictions
>>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
'timestamp': (0.0, 5.44)}]
```
Refer to the blog post [ASR Chunking](https://huggingface.co/blog/asr-chunking) for more details on the chunking algorithm.
### BibTeX entry and citation info
**ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development**
```bibtex
@misc{marmor2023ivritai,
title={ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development},
author={Yanir Marmor and Kinneret Misgav and Yair Lifshitz},
year={2023},
eprint={2307.08720},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
```
**Whisper: Robust Speech Recognition via Large-Scale Weak Supervision**
```bibtex
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
``` |