Spaces:
Running
on
L4
Multilingual transcription: howto w/o specifying language?
A very imp usecase for me is transcription (to say a subset of Indian languages & English). openai/whisper-large preserves the spoken language thru transcription.
I have been following the excellent tutorial: https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb
& a few other similar approaches (e.g https://wandb.ai/parambharat/whisper_finetuning/reports/Fine-tuning-Whisper-for-low-resource-Dravidian-languages--VmlldzozMTYyNTg0). One of my (n00bie) observations is - the target language is specified during training. And the emitted tokens are always in this language (so, e.g kannada or marathi would be emitted in Hindi).
I tried changing fine_tune_whisper.ipynb by omitting language="Hindi" when WhisperTokenizer & WhisperProcessor are inited.
But the demo inference still emits Hindi transcriptions.
So Q: how does one finetune using a dataset for a specific language (say kannada) & still get transcriptions for other languages (say hindi)?
Hey @sanjaymk908 ! What happens when we fine-tune Whisper for one language is that it becomes more biased to this language. We also risk 'catastrophic forgetting' by fine-tuning on one language - the model might forget how to transcribe in other languages. If you want to preserve performance across lots of languages, it's best to use the pre-trained model.
If your fine-tuning language is similar to other languages of interest (e.g. fine-tune on Hindi, and care about Hindi and Urdu), then you'll probably see a benefit for both through fine-tuning. In this case, you can follow the fine-tuning tutorial and set the tokenizer language as required. At inference time, you should set the forced decoder ids to None
, i.e. replace this line:
https://huggingface.co/spaces/whisper-event/whisper-demo/blob/5d4e526c32efcf0bdf726d84160c776d0374fd0b/app.py#L19
With
pipe.model.config.forced_decoder_ids = None
The model will then transcribe the most likely language.
Hi
@sanchit-gandhi
! If I want to train it on English language only, what parts of code should I change from your model on Hindi language?
I followed your tutorial with my custom dataset but I get too high WER. There's something I'm missing and I don't understand where I'm wrong
Hey
@scanne
! In this case, you should load the model from one of the English-only checkpoints (detailed in the intro of the blog post: https://huggingface.co/blog/fine-tune-whisper#introduction, e.g. openai/whisper-small.en
instead of openai/whisper-small
), and you should also omit both the language and task arguments from the tokenizer and processor, i.e.:
from transformers import WhisperTokenizer
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small.en")
and:
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small.en")
Then just make sure your dataset is properly formatted (see https://huggingface.co/docs/datasets/audio_dataset). Otherwise there are no further changes required to fine-tune on English (I've done this previously and it works well: https://huggingface.co/sanchit-gandhi/whisper-large-v2-ft-ls-960h/tree/main)