Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
Abstract
As the size of pre-trained speech recognition models increases, running these large models in low-latency or resource-constrained environments becomes challenging. In this work, we leverage pseudo-labelling to assemble a large-scale open-source dataset which we use to distill the Whisper model into a smaller variant, called Distil-Whisper. Using a simple word error rate (WER) heuristic, we select only the highest quality pseudo-labels for training. The distilled model is 5.8 times faster with 51% fewer parameters, while performing to within 1% WER on out-of-distribution test data in a zero-shot transfer setting. Distil-Whisper maintains the robustness of the Whisper model to difficult acoustic conditions, while being less prone to hallucination errors on long-form audio. Distil-Whisper is designed to be paired with Whisper for speculative decoding, yielding a 2 times speed-up while mathematically ensuring the same outputs as the original model. To facilitate further research in this domain, we make our training code, inference code and models publicly accessible.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data (2023)
- HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models (2023)
- CoLLD: Contrastive Layer-to-layer Distillation for Compressing Multilingual Pre-trained Speech Encoders (2023)
- Massive End-to-end Models for Short Search Queries (2023)
- Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Distil-Whisper: Faster, Smaller, Yet Powerful Speech Recognition!
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 54
Browse 54 models citing this paperDatasets citing this paper 0
No dataset linking this paper