whisper-v2-d3-e3 / README.md

benderrodriguez

Update README.md

b5d0989 verified 5 months ago

preview code

raw

history blame

No virus

5.31 kB

	---
	language:
	- en
	- zh
	- de
	- es
	- ru
	- ko
	- fr
	- ja
	- pt
	- tr
	- pl
	- ca
	- nl
	- ar
	- sv
	- it
	- id
	- hi
	- fi
	- vi
	- he
	- uk
	- el
	- ms
	- cs
	- ro
	- da
	- hu
	- ta
	- 'no'
	- th
	- ur
	- hr
	- bg
	- lt
	- la
	- mi
	- ml
	- cy
	- sk
	- te
	- fa
	- lv
	- bn
	- sr
	- az
	- sl
	- kn
	- et
	- mk
	- br
	- eu
	- is
	- hy
	- ne
	- mn
	- bs
	- kk
	- sq
	- sw
	- gl
	- mr
	- pa
	- si
	- km
	- sn
	- yo
	- so
	- af
	- oc
	- ka
	- be
	- tg
	- sd
	- gu
	- am
	- yi
	- lo
	- uz
	- fo
	- ht
	- ps
	- tk
	- nn
	- mt
	- sa
	- lb
	- my
	- bo
	- tl
	- mg
	- as
	- tt
	- haw
	- ln
	- ha
	- ba
	- jw
	- su
	tags:
	- audio
	- automatic-speech-recognition
	- hf-asr-leaderboard
	widget:
	- example_title: Librispeech sample 1
	src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
	- example_title: Librispeech sample 2
	src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
	pipeline_tag: automatic-speech-recognition
	license: apache-2.0
	datasets:
	- ivrit-ai/whisper-training
	---

	# Whisper

	Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.
	More details about it are available [here](https://huggingface.co/openai/whisper-large-v2).

	whisper-v2-d3-e3 is a version of whisper-large-v2, fine-tuned by [ivrit.ai](https://www.ivrit.ai) to improve Hebrew ASR using crowd-sourced labeling.

	## Model details

	This model comes as a single checkpoint, whisper-v2-d3-e3.
	It is a 1550M parameters multi-lingual ASR solution.

	# Usage

	To transcribe audio samples, the model has to be used alongside a [`WhisperProcessor`](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperProcessor).

	```python
	import torch
	from transformers import WhisperProcessor, WhisperForConditionalGeneration

	SAMPLING_RATE = 16000

	has_cuda = torch.cuda.is_available()
	model_path = 'ivrit-ai/whisper-v2-d3-e3'

	model = WhisperForConditionalGeneration.from_pretrained(model_path)
	if has_cuda:
	model.to('cuda:0')

	processor = WhisperProcessor.from_pretrained(model_path)

	# audio_resample based on entry being part of an existing dataset.
	# Alternatively, this can be loaded from an audio file.
	audio_resample = librosa.resample(entry['audio']['array'], orig_sr=entry['audio']['sampling_rate'], target_sr=SAMPLING_RATE)

	input_features = processor(audio_resample, sampling_rate=SAMPLING_RATE, return_tensors="pt").input_features
	if has_cuda:
	input_features = input_features.to('cuda:0')

	predicted_ids = model.generate(input_features, language='he', num_beams=5)
	transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)

	print(f'Transcript: {transcription[0]}')
	```

	## Evaluation

	You can use the [evaluate_model.py](https://github.com/yairl/ivrit.ai/blob/master/evaluate_model.py) reference on GitHub to evalute the model's quality.

	## Long-Form Transcription

	The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
	algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
	[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
	method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
	can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:

	```python
	>>> import torch
	>>> from transformers import pipeline
	>>> from datasets import load_dataset

	>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

	>>> pipe = pipeline(
	>>> "automatic-speech-recognition",
	>>> model="ivrit-ai/whisper-v2-d3-e3",
	>>> chunk_length_s=30,
	>>> device=device,
	>>> )

	>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
	>>> sample = ds[0]["audio"]

	>>> prediction = pipe(sample.copy(), batch_size=8)["text"]
	" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."

	>>> # we can also return timestamps for the predictions
	>>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
	[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
	'timestamp': (0.0, 5.44)}]
	```

	Refer to the blog post [ASR Chunking](https://huggingface.co/blog/asr-chunking) for more details on the chunking algorithm.



	### BibTeX entry and citation info

	ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development
	```bibtex
	@misc{marmor2023ivritai,
	title={ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development},
	author={Yanir Marmor and Kinneret Misgav and Yair Lifshitz},
	year={2023},
	eprint={2307.08720},
	archivePrefix={arXiv},
	primaryClass={eess.AS}
	}
	```

	Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
	```bibtex
	@misc{radford2022whisper,
	doi = {10.48550/ARXIV.2212.04356},
	url = {https://arxiv.org/abs/2212.04356},
	author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
	title = {Robust Speech Recognition via Large-Scale Weak Supervision},
	publisher = {arXiv},
	year = {2022},
	copyright = {arXiv.org perpetual, non-exclusive license}
	}
	```