ivrit-ai
/

whisper-v2-d3-e3

+---
+language:
+- en
+- zh
+- de
+- es
+- ru
+- ko
+- fr
+- ja
+- pt
+- tr
+- pl
+- ca
+- nl
+- ar
+- sv
+- it
+- id
+- hi
+- fi
+- vi
+- he
+- uk
+- el
+- ms
+- cs
+- ro
+- da
+- hu
+- ta
+- 'no'
+- th
+- ur
+- hr
+- bg
+- lt
+- la
+- mi
+- ml
+- cy
+- sk
+- te
+- fa
+- lv
+- bn
+- sr
+- az
+- sl
+- kn
+- et
+- mk
+- br
+- eu
+- is
+- hy
+- ne
+- mn
+- bs
+- kk
+- sq
+- sw
+- gl
+- mr
+- pa
+- si
+- km
+- sn
+- yo
+- so
+- af
+- oc
+- ka
+- be
+- tg
+- sd
+- gu
+- am
+- yi
+- lo
+- uz
+- fo
+- ht
+- ps
+- tk
+- nn
+- mt
+- sa
+- lb
+- my
+- bo
+- tl
+- mg
+- as
+- tt
+- haw
+- ln
+- ha
+- ba
+- jw
+- su
+tags:
+- audio
+- automatic-speech-recognition
+- hf-asr-leaderboard
+widget:
+- example_title: Librispeech sample 1
+  src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
+- example_title: Librispeech sample 2
+  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
+pipeline_tag: automatic-speech-recognition
+license: apache-2.0
+datasets:
+- ivrit-ai/whisper-training
+---
+# Whisper
+Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.
+More details about it are available [here](https://huggingface.co/openai/whisper-large-v2).
+**whisper-v2-d3-e3** is a version of whisper-large-v2, fine-tuned by [ivrit.ai](https://www.ivrit.ai) to improve Hebrew ASR using crowd-sourced labeling.
+## Model details
+This model comes as a single checkpoint, whisper-v2-d3-e3.
+It is a 1550M parameters multi-lingual ASR solution.
+# Usage
+To transcribe audio samples, the model has to be used alongside a [`WhisperProcessor`](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperProcessor).
+```python
+import torch
+from transformers import WhisperProcessor, WhisperForConditionalGeneration
+SAMPLING_RATE = 16000
+has_cuda = torch.cuda.is_available()
+model_path = 'ivrit-ai/whisper-v2-d3-e3'
+model = WhisperForConditionalGeneration.from_pretrained(model_path)
+if has_cuda:
+    model.to('cuda:0')
+processor = WhisperProcessor.from_pretrained(model_path)
+# audio_resample based on entry being part of an existing dataset.
+# Alternatively, this can be loaded from an audio file.
+audio_resample = librosa.resample(entry['audio']['array'], orig_sr=entry['audio']['sampling_rate'], target_sr=SAMPLING_RATE)
+input_features = processor(audio_resample, sampling_rate=SAMPLING_RATE, return_tensors="pt").input_features
+if has_cuda:
+  input_features = input_features.to('cuda:0')
+predicted_ids = model.generate(input_features, language='he', num_beams=5)
+transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)
+print(f'Transcript: {transcription[0]}')
+```
+## Evaluation
+You can use the [evaluate_model.py](https://github.com/yairl/ivrit.ai/blob/master/evaluate_model.py) reference on GitHub to evalute the model's quality.
+## Long-Form Transcription
+The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
+algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
+[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
+method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
+can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
+```python
+>>> import torch
+>>> from transformers import pipeline
+>>> from datasets import load_dataset
+>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
+>>> pipe = pipeline(
+>>>   "automatic-speech-recognition",
+>>>   model="ivrit-ai/whisper-v2-d3-e3",
+>>>   chunk_length_s=30,
+>>>   device=device,
+>>> )
+>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+>>> sample = ds[0]["audio"]
+>>> prediction = pipe(sample.copy(), batch_size=8)["text"]
+" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
+>>> # we can also return timestamps for the predictions
+>>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
+[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
+  'timestamp': (0.0, 5.44)}]
+```
+Refer to the blog post [ASR Chunking](https://huggingface.co/blog/asr-chunking) for more details on the chunking algorithm.
+### BibTeX entry and citation info
+**ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development**
+```bibtex
+@misc{marmor2023ivritai,
+      title={ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development},
+      author={Yanir Marmor and Kinneret Misgav and Yair Lifshitz},
+      year={2023},
+      eprint={2307.08720},
+      archivePrefix={arXiv},
+      primaryClass={eess.AS}
+}
+```
+**Whisper: Robust Speech Recognition via Large-Scale Weak Supervision**
+```bibtex
+@misc{radford2022whisper,
+  doi = {10.48550/ARXIV.2212.04356},
+  url = {https://arxiv.org/abs/2212.04356},
+  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
+  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
+  publisher = {arXiv},
+  year = {2022},
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+```