sanchit-gandhi HF staff commited on
Commit
c11facb
1 Parent(s): e9ee6fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -90
README.md CHANGED
@@ -112,172 +112,238 @@ pipeline_tag: automatic-speech-recognition
112
  license: apache-2.0
113
  ---
114
 
115
- # Whisper
116
 
117
- [OpenAI's Whisper](https://openai.com/blog/whisper/)
 
 
118
 
119
- The Whisper model was proposed in [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
 
120
 
121
- **Disclaimer**: Content from **this** model card has been written by the Hugging Face team, and parts of it were copy pasted from the original model card.
 
122
 
 
 
123
 
124
- ## Intro
125
 
126
- The first paragraphs of the abstract read as follows :
 
127
 
128
- > We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning.
129
- > When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
 
 
130
 
131
- The original code repository can be found [here](https://github.com/openai/whisper).
 
 
 
 
132
 
133
- ## Model details
 
 
 
 
 
 
 
134
 
135
- The Whisper models are trained for speech recognition and translation tasks, capable of transcribing speech audio into the text in the language it is spoken (ASR) as well as translated into English (speech translation). Researchers at OpenAI developed the models to study the robustness of speech processing systems trained under large-scale weak supervision. There are 9 models of different sizes and capabilities, summarised in the following table.
136
-
137
- | Size | Parameters | English-only model | Multilingual model |
138
- |:------:|:----------:|:------------------:|:------------------:|
139
- | tiny | 39 M | ✓ | ✓ |
140
- | base | 74 M | ✓ | ✓ |
141
- | small | 244 M | ✓ | ✓ |
142
- | medium | 769 M | ✓ | ✓ |
143
- | large | 1550 M | | ✓ |
144
-
145
-
146
-
147
- ## Model description
148
 
149
- Whisper is an auto-regressive automatic speech recognition encoder-decoder model that was trained on 680 000 hours of 16kHz sampled multilingual audio. It was fully trained in a supervised manner, with multiple tasks :
150
 
151
- - English transcription
152
- - Any-to-English speech translation
153
- - Non-English transcription
154
- - No speech prediction
155
 
156
- To each task corresponds a sequence of tokens that are given to the decoder as *context tokens*. The beginning of a transcription always starts with `<|startoftranscript|>` which is why the `decoder_start_token` is always set to `tokenizer.encode("<|startoftranscript|>")`. The following token should be the language token, which is automatically detected in the original code. Finally, the task is define using either `<|transcribe|>` or `<|translate|>`. In addition, a `<|notimestamps|>` token is added if the task does not include timestamp prediction.
 
 
 
 
 
157
 
 
 
 
 
 
158
 
159
- # Usage
 
 
160
 
161
- To transcribe or translate audio files, the model has to be used along a `WhisperProcessor`. The `WhisperProcessor.get_decoder_prompt_ids` function is used to get a list of `( idx, token )` tuples, which can either be set in the config, or directly passed to the generate function, as `forced_decoder_ids`.
162
 
 
 
 
163
 
164
- ## Transcription
165
- In the following example, the english only model is used. We set the `decoder_input_ids` accordingly.
166
 
 
167
 
168
- ### English to english
169
- The "<|en|>" token is used to specify that the speech is in english and should be transcribed to english
 
170
 
171
  ```python
172
  >>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
173
  >>> from datasets import load_dataset
174
- >>> import torch
175
 
176
  >>> # load model and processor
177
  >>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
178
  >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
 
179
 
180
- >>> # load dummy dataset and read soundfiles
181
  >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
182
- >>> input_features = processor(ds[0]["audio"]["array"], return_tensors="pt").input_features
 
183
 
184
- >>> # Generate logits
185
- >>> logits = model(input_features, decoder_input_ids = torch.tensor([[50258]])).logits
186
- >>> # take argmax and decode
187
- >>> predicted_ids = torch.argmax(logits, dim=-1)
188
- >>> transcription = processor.batch_decode(predicted_ids)
189
- ['<|en|>']
 
 
190
  ```
 
191
 
192
  ### French to French
193
- In order to obtain the full transcription, the `generate()` function is used. The following example demonstrates a french to french
194
- transcription.
195
 
196
  ```python
197
  >>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
198
- >>> from datasets import load_dataset
199
- >>> import torch
200
 
201
  >>> # load model and processor
202
  >>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
203
  >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
 
204
 
205
- >>> # load dummy dataset and read soundfiles
206
  >>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
207
- >>> ds = ds.cast_column("audio", datasets.Audio(sampling_rate=16_000))
208
- >>> input_speech = next(iter(ds))["audio"]["array"]
209
- >>> model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "fr", task = "transcribe")
210
- >>> input_features = processor(input_speech, return_tensors="pt").input_features
211
- >>> predicted_ids = model.generate(input_features)
 
 
212
  >>> transcription = processor.batch_decode(predicted_ids)
213
  ['<|startoftranscript|><|fr|><|transcribe|><|notimestamps|> Un vrai travail intéressant va enfin être mené sur ce sujet.<|endoftext|>']
214
 
215
- >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens = True)
216
  [' Un vrai travail intéressant va enfin être mené sur ce sujet.']
217
  ```
218
 
219
  ## Translation
220
- The `"<|translate|>"` token is used as the first decoder input token to specify the translation task.
221
 
222
  ### French to English
223
 
224
  ```python
225
  >>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
226
- >>> from datasets import load_dataset
227
- >>> import torch
228
 
229
  >>> # load model and processor
230
  >>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
231
  >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
 
232
 
233
- >>> # load dummy dataset and read soundfiles
234
  >>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
235
- >>> ds = ds.cast_column("audio", datasets.Audio(sampling_rate=16_000))
236
- >>> input_speech = next(iter(ds))["audio"]["array"]
237
- >>> # tokenize
238
- >>> input_features = processor(input_speech, return_tensors="pt").input_features
239
- >>> forced_decoder_ids = processor.get_decoder_prompt_ids(language = "fr", task = "translate")
240
-
241
- >>> predicted_ids = model.generate(input_features, forced_decoder_ids = forced_decoder_ids)
242
- >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens = True)
243
- [' A real interesting work will be done on this subject.']
244
  ```
245
 
246
  ## Evaluation
247
 
248
- This code snippet shows how to evaluate **openai/whisper-large-v2** on LibriSpeech's "clean" and "other" test data.
249
 
250
  ```python
251
  >>> from datasets import load_dataset
252
  >>> from transformers import WhisperForConditionalGeneration, WhisperProcessor
253
- >>> import soundfile as sf
254
  >>> import torch
255
- >>> from jiwer import wer
256
 
 
257
 
258
- >>> librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
259
-
260
- >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2").to("cuda")
261
  >>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
 
262
 
263
  >>> def map_to_pred(batch):
264
- >>> input_features = processor(batch["audio"]["array"], return_tensors="pt").input_features
265
-
 
 
266
  >>> with torch.no_grad():
267
- >>> logits = model(input_features.to("cuda")).logits
268
-
269
- >>> predicted_ids = torch.argmax(logits, dim=-1)
270
- >>> transcription = processor.batch_decode(predicted_ids, normalize = True)
271
- >>> batch['text'] = processor.tokenizer._normalize(batch['text'])
272
- >>> batch["transcription"] = transcription
273
  >>> return batch
274
 
275
- >>> result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
276
 
277
- >>> print("WER:", wer(result["text"], result["transcription"]))
278
- 0.030003583080317572
 
279
  ```
280
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
281
 
282
  ### Evaluated Use
283
 
@@ -314,12 +380,14 @@ There are also potential dual use concerns that come with releasing Whisper. Whi
314
 
315
 
316
  ### BibTeX entry and citation info
317
- *Since no official citation was provided, we use the following in the mean time*
318
  ```bibtex
319
  @misc{radford2022whisper,
320
- title={Robust Speech Recognition via Large-Scale Weak Supervision.},
321
- author={Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever},
322
- year={2022},
323
- url={https://cdn.openai.com/papers/whisper.pdf},
 
 
 
324
  }
325
  ```
 
112
  license: apache-2.0
113
  ---
114
 
115
+ # Whisper
116
 
117
+ Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours
118
+ of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains **without** the need
119
+ for fine-tuning.
120
 
121
+ Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
122
+ by Alec Radford et al. from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper).
123
 
124
+ Compared to the Whisper large model, the large-v2 model is trained for 2.5x more epochs with added regularization
125
+ for improved performance.
126
 
127
+ **Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
128
+ copied and pasted from the original model card.
129
 
130
+ ## Model details
131
 
132
+ Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model.
133
+ It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision.
134
 
135
+ The models were trained on either English-only data or multilingual data. The English-only models were trained
136
+ on the task of speech recognition. The multilingual models were trained on both speech recognition and speech
137
+ translation. For speech recognition, the model predicts transcriptions in the *same* language as the audio.
138
+ For speech translation, the model predicts transcriptions to a *different* language to the audio.
139
 
140
+ Whisper checkpoints come in five configurations of varying model sizes.
141
+ The smallest four are trained on either English-only or multilingual data.
142
+ The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints
143
+ are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
144
+ checkpoints are summarised in the following table with links to the models on the Hub:
145
 
146
+ | Size | Parameters | English-only | Multilingual |
147
+ |----------|------------|------------------------------------------------------|-----------------------------------------------------|
148
+ | tiny | 39 M | [✓](https://huggingface.co/openai/whisper-tiny.en) | [✓](https://huggingface.co/openai/whisper-tiny) |
149
+ | base | 74 M | [✓](https://huggingface.co/openai/whisper-base.en) | [✓](https://huggingface.co/openai/whisper-base) |
150
+ | small | 244 M | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) |
151
+ | medium | 769 M | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) |
152
+ | large | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large) |
153
+ | large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) |
154
 
155
+ # Usage
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
+ To transcribe audio samples, the model has to be used alongside a [`WhisperProcessor`](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperProcessor).
158
 
159
+ The `WhisperProcessor` is used to:
160
+ 1. Pre-process the audio inputs (converting them to log-Mel spectrograms for the model)
161
+ 2. Post-process the model outputs (converting them from tokens to text)
 
162
 
163
+ The model is informed of which task to perform (transcription or translation) by passing the appropriate "context tokens". These context tokens
164
+ are a sequence of tokens that are given to the decoder at the start of the decoding process, and take the following order:
165
+ 1. The transcription always starts with the `<|startoftranscript|>` token
166
+ 2. The second token is the language token (e.g. `<|en|>` for English)
167
+ 3. The third token is the "task token". It can take one of two values: `<|transcribe|>` for speech recognition or `<|translate|>` for speech translation
168
+ 4. In addition, a `<|notimestamps|>` token is added if the model should not include timestamp prediction
169
 
170
+ Thus, a typical sequence of context tokens might look as follows:
171
+ ```
172
+ <|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>
173
+ ```
174
+ Which tells the model to decode in English, under the task of speech recognition, and not to predict timestamps.
175
 
176
+ These tokens can either be forced or un-forced. If they are forced, the model is made to predict each token at
177
+ each position. This allows one to control the output language and task for the Whisper model. If they are un-forced,
178
+ the Whisper model will automatically predict the output langauge and task itself.
179
 
180
+ The context tokens can be set accordingly:
181
 
182
+ ```python
183
+ model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="english", task="transcribe")
184
+ ```
185
 
186
+ Which forces the model to predict in English under the task of speech recognition.
 
187
 
188
+ ## Transcription
189
 
190
+ ### English to English
191
+ In this example, the context tokens are 'unforced', meaning the model automatically predicts the output language
192
+ (English) and task (transcribe).
193
 
194
  ```python
195
  >>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
196
  >>> from datasets import load_dataset
 
197
 
198
  >>> # load model and processor
199
  >>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
200
  >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
201
+ >>> model.config.forced_decoder_ids = None
202
 
203
+ >>> # load dummy dataset and read audio files
204
  >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
205
+ >>> sample = ds[0]["audio"]
206
+ >>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
207
 
208
+ >>> # generate token ids
209
+ >>> predicted_ids = model.generate(input_features)
210
+ >>> # decode token ids to text
211
+ >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
212
+ ['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>']
213
+
214
+ >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
215
+ [' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']
216
  ```
217
+ The context tokens can be removed from the start of the transcription by setting `skip_special_tokens=True`.
218
 
219
  ### French to French
220
+ The following example demonstrates French to French transcription by setting the decoder ids appropriately.
 
221
 
222
  ```python
223
  >>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
224
+ >>> from datasets import Audio, load_dataset
 
225
 
226
  >>> # load model and processor
227
  >>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
228
  >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
229
+ >>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")
230
 
231
+ >>> # load streaming dataset and read first audio sample
232
  >>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
233
+ >>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
234
+ >>> input_speech = next(iter(ds))["audio"]
235
+ >>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features
236
+
237
+ >>> # generate token ids
238
+ >>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
239
+ >>> # decode token ids to text
240
  >>> transcription = processor.batch_decode(predicted_ids)
241
  ['<|startoftranscript|><|fr|><|transcribe|><|notimestamps|> Un vrai travail intéressant va enfin être mené sur ce sujet.<|endoftext|>']
242
 
243
+ >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
244
  [' Un vrai travail intéressant va enfin être mené sur ce sujet.']
245
  ```
246
 
247
  ## Translation
248
+ Setting the task to "translate" forces the Whisper model to perform speech translation.
249
 
250
  ### French to English
251
 
252
  ```python
253
  >>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
254
+ >>> from datasets import Audio, load_dataset
 
255
 
256
  >>> # load model and processor
257
  >>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
258
  >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
259
+ >>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")
260
 
261
+ >>> # load streaming dataset and read first audio sample
262
  >>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
263
+ >>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
264
+ >>> input_speech = next(iter(ds))["audio"]
265
+ >>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features
266
+
267
+ >>> # generate token ids
268
+ >>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
269
+ >>> # decode token ids to text
270
+ >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
271
+ [' A very interesting work, we will finally be given on this subject.']
272
  ```
273
 
274
  ## Evaluation
275
 
276
+ This code snippet shows how to evaluate Whisper Large on [LibriSpeech test-clean](https://huggingface.co/datasets/librispeech_asr):
277
 
278
  ```python
279
  >>> from datasets import load_dataset
280
  >>> from transformers import WhisperForConditionalGeneration, WhisperProcessor
 
281
  >>> import torch
282
+ >>> from evaluate import load
283
 
284
+ >>> librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")
285
 
 
 
 
286
  >>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
287
+ >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2").to("cuda")
288
 
289
  >>> def map_to_pred(batch):
290
+ >>> audio = batch["audio"]
291
+ >>> input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
292
+ >>> batch["reference"] = processor.tokenizer._normalize(batch['text'])
293
+ >>>
294
  >>> with torch.no_grad():
295
+ >>> predicted_ids = model.generate(input_features.to("cuda"))[0]
296
+ >>> transcription = processor.decode(predicted_ids)
297
+ >>> batch["prediction"] = processor.tokenizer._normalize(transcription)
 
 
 
298
  >>> return batch
299
 
300
+ >>> result = librispeech_test_clean.map(map_to_pred)
301
 
302
+ >>> wer = load("wer")
303
+ >>> print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
304
+ 3.0003583080317572
305
  ```
306
 
307
+ ## Long-Form Transcription
308
+
309
+ The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
310
+ algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
311
+ [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
312
+ method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. It can also be extended to
313
+ predict utterance level timestamps by passing `return_timestamps=True`:
314
+
315
+ ```python
316
+ >>> import torch
317
+ >>> from transformers import pipeline
318
+ >>> from datasets import load_dataset
319
+
320
+ >>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
321
+
322
+ >>> pipe = pipeline(
323
+ >>> "automatic-speech-recognition",
324
+ >>> model="openai/whisper-large-v2",
325
+ >>> chunk_length_s=30,
326
+ >>> device=device,
327
+ >>> )
328
+
329
+ >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
330
+ >>> sample = ds[0]["audio"]
331
+
332
+ >>> prediction = pipe(sample)["text"]
333
+ " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
334
+
335
+ >>> # we can also return timestamps for the predictions
336
+ >>> prediction = pipe(sample, return_timestamps=True)["chunks"]
337
+ [{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
338
+ 'timestamp': (0.0, 5.44)}]
339
+ ```
340
+
341
+ ## Fine-Tuning
342
+
343
+ The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,
344
+ its predictive capabilities can be improved further for certain languages and tasks through *fine-tuning*. The blog
345
+ post [Fine-Tune Whisper with 🤗 Transformers](https://huggingface.co/blog/fine-tune-whisper) provides a step-by-step
346
+ guide to fine-tuning the Whisper model with as little as 5 hours of labelled data.
347
 
348
  ### Evaluated Use
349
 
 
380
 
381
 
382
  ### BibTeX entry and citation info
 
383
  ```bibtex
384
  @misc{radford2022whisper,
385
+ doi = {10.48550/ARXIV.2212.04356},
386
+ url = {https://arxiv.org/abs/2212.04356},
387
+ author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
388
+ title = {Robust Speech Recognition via Large-Scale Weak Supervision},
389
+ publisher = {arXiv},
390
+ year = {2022},
391
+ copyright = {arXiv.org perpetual, non-exclusive license}
392
  }
393
  ```