Automatic Speech Recognition
Transformers
Safetensors
whisper
audio
hf-asr-leaderboard
Inference Endpoints
benderrodriguez commited on
Commit
b5d0989
1 Parent(s): 5912f33

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +229 -3
README.md CHANGED
@@ -1,3 +1,229 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ - de
6
+ - es
7
+ - ru
8
+ - ko
9
+ - fr
10
+ - ja
11
+ - pt
12
+ - tr
13
+ - pl
14
+ - ca
15
+ - nl
16
+ - ar
17
+ - sv
18
+ - it
19
+ - id
20
+ - hi
21
+ - fi
22
+ - vi
23
+ - he
24
+ - uk
25
+ - el
26
+ - ms
27
+ - cs
28
+ - ro
29
+ - da
30
+ - hu
31
+ - ta
32
+ - 'no'
33
+ - th
34
+ - ur
35
+ - hr
36
+ - bg
37
+ - lt
38
+ - la
39
+ - mi
40
+ - ml
41
+ - cy
42
+ - sk
43
+ - te
44
+ - fa
45
+ - lv
46
+ - bn
47
+ - sr
48
+ - az
49
+ - sl
50
+ - kn
51
+ - et
52
+ - mk
53
+ - br
54
+ - eu
55
+ - is
56
+ - hy
57
+ - ne
58
+ - mn
59
+ - bs
60
+ - kk
61
+ - sq
62
+ - sw
63
+ - gl
64
+ - mr
65
+ - pa
66
+ - si
67
+ - km
68
+ - sn
69
+ - yo
70
+ - so
71
+ - af
72
+ - oc
73
+ - ka
74
+ - be
75
+ - tg
76
+ - sd
77
+ - gu
78
+ - am
79
+ - yi
80
+ - lo
81
+ - uz
82
+ - fo
83
+ - ht
84
+ - ps
85
+ - tk
86
+ - nn
87
+ - mt
88
+ - sa
89
+ - lb
90
+ - my
91
+ - bo
92
+ - tl
93
+ - mg
94
+ - as
95
+ - tt
96
+ - haw
97
+ - ln
98
+ - ha
99
+ - ba
100
+ - jw
101
+ - su
102
+ tags:
103
+ - audio
104
+ - automatic-speech-recognition
105
+ - hf-asr-leaderboard
106
+ widget:
107
+ - example_title: Librispeech sample 1
108
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
109
+ - example_title: Librispeech sample 2
110
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
111
+ pipeline_tag: automatic-speech-recognition
112
+ license: apache-2.0
113
+ datasets:
114
+ - ivrit-ai/whisper-training
115
+ ---
116
+
117
+ # Whisper
118
+
119
+ Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.
120
+ More details about it are available [here](https://huggingface.co/openai/whisper-large-v2).
121
+
122
+ **whisper-v2-d3-e3** is a version of whisper-large-v2, fine-tuned by [ivrit.ai](https://www.ivrit.ai) to improve Hebrew ASR using crowd-sourced labeling.
123
+
124
+ ## Model details
125
+
126
+ This model comes as a single checkpoint, whisper-v2-d3-e3.
127
+ It is a 1550M parameters multi-lingual ASR solution.
128
+
129
+ # Usage
130
+
131
+ To transcribe audio samples, the model has to be used alongside a [`WhisperProcessor`](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperProcessor).
132
+
133
+ ```python
134
+ import torch
135
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
136
+
137
+ SAMPLING_RATE = 16000
138
+
139
+ has_cuda = torch.cuda.is_available()
140
+ model_path = 'ivrit-ai/whisper-v2-d3-e3'
141
+
142
+ model = WhisperForConditionalGeneration.from_pretrained(model_path)
143
+ if has_cuda:
144
+ model.to('cuda:0')
145
+
146
+ processor = WhisperProcessor.from_pretrained(model_path)
147
+
148
+ # audio_resample based on entry being part of an existing dataset.
149
+ # Alternatively, this can be loaded from an audio file.
150
+ audio_resample = librosa.resample(entry['audio']['array'], orig_sr=entry['audio']['sampling_rate'], target_sr=SAMPLING_RATE)
151
+
152
+ input_features = processor(audio_resample, sampling_rate=SAMPLING_RATE, return_tensors="pt").input_features
153
+ if has_cuda:
154
+ input_features = input_features.to('cuda:0')
155
+
156
+ predicted_ids = model.generate(input_features, language='he', num_beams=5)
157
+ transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)
158
+
159
+ print(f'Transcript: {transcription[0]}')
160
+ ```
161
+
162
+ ## Evaluation
163
+
164
+ You can use the [evaluate_model.py](https://github.com/yairl/ivrit.ai/blob/master/evaluate_model.py) reference on GitHub to evalute the model's quality.
165
+
166
+ ## Long-Form Transcription
167
+
168
+ The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
169
+ algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
170
+ [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
171
+ method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
172
+ can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
173
+
174
+ ```python
175
+ >>> import torch
176
+ >>> from transformers import pipeline
177
+ >>> from datasets import load_dataset
178
+
179
+ >>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
180
+
181
+ >>> pipe = pipeline(
182
+ >>> "automatic-speech-recognition",
183
+ >>> model="ivrit-ai/whisper-v2-d3-e3",
184
+ >>> chunk_length_s=30,
185
+ >>> device=device,
186
+ >>> )
187
+
188
+ >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
189
+ >>> sample = ds[0]["audio"]
190
+
191
+ >>> prediction = pipe(sample.copy(), batch_size=8)["text"]
192
+ " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
193
+
194
+ >>> # we can also return timestamps for the predictions
195
+ >>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
196
+ [{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
197
+ 'timestamp': (0.0, 5.44)}]
198
+ ```
199
+
200
+ Refer to the blog post [ASR Chunking](https://huggingface.co/blog/asr-chunking) for more details on the chunking algorithm.
201
+
202
+
203
+
204
+ ### BibTeX entry and citation info
205
+
206
+ **ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development**
207
+ ```bibtex
208
+ @misc{marmor2023ivritai,
209
+ title={ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development},
210
+ author={Yanir Marmor and Kinneret Misgav and Yair Lifshitz},
211
+ year={2023},
212
+ eprint={2307.08720},
213
+ archivePrefix={arXiv},
214
+ primaryClass={eess.AS}
215
+ }
216
+ ```
217
+
218
+ **Whisper: Robust Speech Recognition via Large-Scale Weak Supervision**
219
+ ```bibtex
220
+ @misc{radford2022whisper,
221
+ doi = {10.48550/ARXIV.2212.04356},
222
+ url = {https://arxiv.org/abs/2212.04356},
223
+ author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
224
+ title = {Robust Speech Recognition via Large-Scale Weak Supervision},
225
+ publisher = {arXiv},
226
+ year = {2022},
227
+ copyright = {arXiv.org perpetual, non-exclusive license}
228
+ }
229
+ ```