openai/whisper-large-v3 · Inference on fine-tuned whisper-large-v3 is not working, but is working on pre-trained model and whisper-medium

Hello,

I'm using this function for inference:

def eval(model_name, input_file):

test_dataset = Dataset.from_pandas(pd.read_excel(input_file).head(3))

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_name)

pipe = pipeline(
        "automatic-speech-recognition",
        model=model,
        tokenizer=processor.tokenizer,
        feature_extractor=processor.feature_extractor,
        torch_dtype=torch_dtype,
        device=device,
)

for row in test_dataset:
    audio_path = row['Path']

    result = pipe(audio_path, generate_kwargs={"language": "english"})
    print(result["text"])

When I use:

model_name ="openai/whisper-medium"

I get this output: "Okay, so what was your motivation to join the xxx study? Like, apart from what they told you that you should join?"

model_name =".../fine_tune/wmau_none_P16/2_3200" # fine-tuned medium

I get this output: "okay so what was your motivation to join the athletes study like it apart from what they told you that you should join"

model_name ="openai/whisper-large-v3"

I get this output: "Okay. So what was your motivation to join the xxx study? Like apart from what they told you that you should join?"

model_name = ".../fine_tune/wl3au_none_P16/2_128000" # fine-tuned largev3

I get this output: "so what was your motivation to join the the the the the the"

This is the same trend I observe across all of my test set. Generated output for fine-tuned Whisper large v3 is too short, although the audio file is shorter than 30 seconds, the generated text is only a few words. The same code works fine for fine-tuned medium model, as well as both pre-trained models. Any thoughts why is this happening?