Distil-Whisper Acknowledgements
Hey @rsonavane ! We've been working extensively on distilling Whisper and just came across your model. Super cool to see that you tried a similar technique to us (KL + CE loss for decoder distillation) and got some good results! We'd like to include this model in the acknowledgements as a nod that you too had experimented with this early on. Also keen to hear whether you experimented any further with this, i.e. by scaling it up to larger datasets or higher model compression. Would love to share notes!
For this experiment, I only considered common voice and librispeech datasets. My goal was to create a model with near whisper-large-v2 accuracy while achieving wav2vec2 onnx inference speed. Though started as a fun experiment, it concisely helped me understand the effect of decoder pruning. I benchmarked latency, WER, and CER numbers on A100 for short-length transcription audios while changing the number of decoder layers to prune. As I read your paper and the experiments you've presented, your conclusions are spot-on.
Thanks
@sanchit-gandhi
for a note of acknowledgment.