Spaces:
Running
on
L4
performance drops and Hallucination Increases after finetuning the Whisper-small
Hello,
I'm currently working on fine-tuning the Whisper-Small model for my specific use case. My dataset consists of Hinglish (a mix of Hindi and English) audio samples paired with their corresponding English text. I aim to generate outputs in English only. I am fine-tuning the model for translation task. The custom dataset comprises approximately 200 hours of audio clips, each lasting between 15 to 30 seconds.
During training, I've noticed that both the training loss and validation loss decrease consistently, and the BLEU score on the validation set is showing improvement. However, when I perform inference on the testing set, the model's performance drops.
At checkpoint-3000 (3000 steps), my model's performance was marginally better (around 1%) than the base Whisper-Small model. Yet, upon further fine-tuning for additional steps, the model's performance on the testing set declined. I observed the model seemed to hallucinate too much(words and the sentence repetition), leading to a decrease in overall performance as the number of iterations increased.
the checkpoint 18000 has the worst performance because the hallucination was too much in the outputs.
what could be the possible cause for this? and how can I improve this?
Following the blog by @sanchit-gandhi (https://huggingface.co/blog/fine-tune-whisper), I made some adjustments to the fine tuning script:
1)Set the language to "Hindi" and task to "translation"
3)set model.config.forced_decoder_ids = tokenizer.get_decoder_prompt_ids(language="hi", task="translate").
4)Changed the evaluation metric to "BLEU."
Sounds like overfitting to me, but it could be anything. Check this out: "Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models" https://arxiv.org/abs/2205.10770
@alfonsofr although it sound like overfitting, but I tried all the checkpoint with 500 steps, 1000 steps and 2000 steps etc. the results are worse than the base model. only for checkpoint-3000 the results slightly improved (around 1%). and considering the amount of Data I have i.e.more than 200 hours, It's really tough to agree if it's because of overfitting. my instinct can be wrong as well, as it's hard to figure out where things are breaking.
the major problem I am facing is the hallucination basically Repetition of the words/sentences which is degrading the model performance.
I'll def. look at the paper which you mentioned and see if I can get any fruitful way which can help me to finetune the model.
I met a similar question: the training loss decreases normally, while the WER on training/val dataset are both nearly 90%. So this must not be overfitting.
Did you find the cause? I'm experiencing similar issues with whisper-small on the common-voice dataset for Dutch..