How to fine tune the model

#6
by tahercoolguy - opened

Can we really fine tune model with our datasets

Another tutorial on finetuning.

https://huggingface.co/blog/fine-tune-whisper

Can you please show an example of how to create a new tokenizer and than fine-tune whisper with it?

Somebody pls guide me.
I have set of thousand of sentences. I want use existing whisper model and just rescore it to fit to my scenario as it will be one of those sentences only.
Is it possible to do in whisper?

Hey @EranML

My advice would be to follow this blog-post for fine-tuning the model: https://huggingface.co/blog/fine-tune-whisper

The Whisper model is pre-trained on 96 languages. This means that the pre-trained tokenizer already has a vast vocabulary encompassing many thousands of words! I would recommend that you leverage this pre-trained tokenizer directly rather than training a new one. Why? Because then we can also leverage all of the pre-trained Whisper weights directly! If we build a new tokenizer, we have to randomly initialise some of the Whisper weights to work with our new tokenizer, meaning we lose some of the knowledge from pre-training. If we use the pre-trained one, we can use all of the weights (and so all of the knowledge!). The Whisper model quickly learns which bit of the pre-trained tokenizer to use when fine-tuning.

So I’d recommend you keep the pre-trained tokenizer, and simply set the correct language when you instantiate the processor in this line: https://huggingface.co/blog/fine-tune-whisper#combine-to-create-a-whisperprocessor

Yes there’s a bit of redundancy in the tokenizer, but our overall performance should be better!

What language are you fine-tuning on? It's probably quite likely that all the characters you need are already in the pre-trained Whisper tokenizer!

Hey @kundanashish ! To clarify, you want to improve the Whisper model's performance on your set of 1000 sentences, but don't care about how it performs on any others? You can simply fine-tune it on these sentences using this blog-post: https://huggingface.co/blog/fine-tune-whisper

You might first need to convert your audio-text dataset into a HF dataset format: https://huggingface.co/docs/datasets/audio_dataset

Hi @sanchit-gandhi
Thanks for your response.
Yes you understanding is correct.
Actually I only have text. I do not want to use the existing the acoustic model and do fine tunning at language model layer.

Hey @kundanashish ! Sorry for the late reply here. I would strongly advise against fine-tuning only the language model (decoder) of the Whisper model on text-only data. My worry here is that we will completely break the model and loose all it's pre-trained capabilities if we do this.

Whisper is an encoder-decoder architecture. The encoder transforms the audio inputs into a set of hidden state representations, extracting important features from the spoken speech. The decoder auto-regressively predicts text tokens, conditional on previously predicted tokens and the encoder hidden states (see https://huggingface.co/blog/encoder-decoder#encoder-decoder). If we omit the encoder hidden-states, we completely change the functionality of the Whisper model: the decoder now only predicts tokens conditional on the previously predicted tokens, not the encoder hidden states. This will change the weights such that the model only uses the previous tokens and not the encoder hidden representations. Thus, the model goes from being purposed for speech recognition (speech to text) to causal language modelling (text to text). When we use this fine-tuned model at inference time, this time with the audio inputs, the weights will be messed-up for speech recognition and the model will likely fail.

I would recommend either:

  1. Fine-tuning the model on audio-transcription pairs (i.e. get the audio for your text sentences and train on audio + text) according to the blog post
  2. Using the zero-shot model (no fine-tuning) to generate Whisper predictions. Take the prediction from the Whisper model, and find the sentence in your corpus of 1000 sentences that is most similar to this prediction. Use this nearest sentence as your output.

@sanchit-gandhi in the blog post you've mentioned, what are some of the parameters you've played around with for getting best results? Further, any tips on how to go about tweaking said params?

Hey @jungledude23 !

In my experience, the most important three are:

  1. Batch size
  2. Learning rate
  3. Dropout

1. Batch size

One thing I've noticed a lot looking at training logs is noisy training loss curves. This generally gives noisy parameter updates, which can throw your model off and delay it reaching a local optimum. A noisy training loss can be combated by increasing your batch size. A larger batch size means more training samples per update, and is thus closer to a 'true' gradient update that you'd get using all the data at once. You can find recommended batch size configurations here https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event#recommended-training-configurations

2. Learning rate

  • As a rule of thumb, a learning rate 40x smaller than the pre-training learning rate works well (see page 28 of the Whisper paper)
  • Monitor the training loss over the first 1000 steps. If it decays quickly and smoothly, you've got a good learning rate. If it bounces around and is very noisy, the learning rate is too high and you should reduce it by at least a factor of 10

3. Dropout

More details regarding learning rate: The learning rate is indeed a very important parameter to get good fine-tuning performance, and one that we have to experiment with to get right. My recommendation would be to monitor the training loss for the first 500-1000 training steps of your fine-tuning run to gauge whether you've set the learning rate appropriately. Each case is different, but I've tried to give a setting that works best for most!

In practice, using a lower learning rate for fine-tuning vs pre-training gives superior results. These are the observations that I made when fine-tuning the Whisper model for the ESB paper (https://arxiv.org/abs/2210.13352) and from my extensive testing for multilingual fine-tuning prior to the event. Generally, I found that a learning rate of 1e-5 worked well for the small and medium checkpoints across most languages. This is the highest learning rate that you can get away with without the gradient updates becoming noisy. Selecting a higher learning rate means that you perform larger parameter updates, and so should be able to push the parameters into a more optimal range faster. But if you go too high, you risk the gradient updates becoming unstable, giving a noisy training loss curve and noisy parameter updates. This is when you'll get worse performance.

I asked the Whisper author Jong Wook Kim about his suggestions for fine-tuning. His recommendation was to select a learning rate about 40x smaller than pre-training, and linearly decay it to 0 over the course of training. For the small checkpoint, this would be 5e-4 / 40 = 1.25e-5, near enough 1e-5! So my empirical observations align with his 🙂

You can use this as a rule of thumb for selecting the learning rate!

Hi @sanchit-gandhi
Thanks for the response.
Can't I just finetune using text data and audio is mandatory.

Please pardon me if I am sounding silly, I am a newbie in this field.

Hi @kundanashish ,

You must have your X, and y values if you wish to fine tune on your specific task.

Maybe an easier example would be training an image classifier to classify images of cats and dogs.
You have the images(your X values) and their labels like "cat" or "dog"(your y values).

Now imagine you want to train this model without any images.

This is akin to trying to tune whisper without audio and just text.

Love the analogy @Kristopher ! 🙌 Indeed we need (text, audio) pairs for fine-tuning to work.

Have you considered option two from this list @kundanashish ? https://huggingface.co/spaces/openai/whisper/discussions/6#63c142a294b28327f0e6bebd

It could work using the pre-trained Whisper model to generate predictions for the transcriptions, and then picking the sentence in your set of 1000 sentences that is most similar to this prediction? What do you think?

Hi Sanchit, I have my mapping.csv that has audio, sentence -- The audio field is the path to the audio.
When I try to train following your https://huggingface.co/blog/fine-tune-whisper tutorial, I get the following:

The following columns in the training set don't have a corresponding argument in `WhisperForConditionalGeneration.forward` and have been ignored: audio, sentence. If audio, sentence are not expected by `WhisperForConditionalGeneration.forward`,  you can safely ignore this message.

In your tutorial, the first element of the dataset has more info:

{'audio': {'path': '/home/sanchit_huggingface_co/.cache/huggingface/datasets/downloads/extracted/607848c7e74a89a3b5225c0fa5ffb9470e39b7f11112db614962076a847f3abf/cv-corpus-11.0-2022-09-21/hi/clips/common_voice_hi_25998259.mp3', 
           'array': array([0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 9.6724887e-07,
       1.5334779e-06, 1.0415988e-06], dtype=float32), 
           'sampling_rate': 48000},
 'sentence': 'खीर की मिठास पर गरमाई बिहार की सियासत, कुशवाहा ने दी सफाई'}

Is there a specific format for my mapping.csv? Thanks for the great work as usual!

Hey @asennoussi !

The warning message suggests to me that something is going wrong either in the data pre-processing stage. We shouldn't have features like audio and sentence forwarded to our data collator.

Data pre-processing

Currently, our data pre-processing function looks as follows:

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

This assumes that our audio dataset has a column called audio which is the loaded audio array.

If we're working with a .csv file containing pairs of (path to audio, text), we first need to load the audio samples before pre-processing them with our processor class. We can do this in one of two ways:

  1. Load your dataset as a HF dataset
  2. Load each audio sample on the fly

For 1, you can follow this guide: https://huggingface.co/docs/datasets/audio_dataset#create-an-audio-dataset. Once you've created your audio dataset and pushed it to the Hub, you can simply load it using the load_dataset function and follow the blog post from start to finish!

For 2, we can make a few modifications to the prepare_dataset function to first load our audio from the path. We can do this with the librosa library:

import librosa  

def prepare_dataset(batch):
  
     # load audio sample FROM PATH with specified sampling rate   
    audio_array, sampling_rate = librosa.load(batch[“audio_path”], sr=16000, mono=True)

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio_array, sampling_rate=sampling_rate).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

You’ll just need to double check that the function has the right feature names to match your dataset (I’ve used “audio_path” and “sentence” but you might need to change these)

Let me know if that helps with your question! Feel free to post a code snippet / Colab link to your code if you want to share what you're currently doing (this might make it easier to fully dissect what's going on)

Thanks for the thorough answer Sanchit!

@sanchit-gandhi I faced something weird.
I do have a large dataset: around 100GB of audio files that I split into snippets for training, so around 200 GBs of space.
Yesterday, the training stopped because of insufficient space.
When I looked around, I found a ~/.cache/huggingface/datasets/csv/default-1c16e2184a1fda8d/0.0.0/Somfile that is +500GB in size.
Why does that happen? I'm just curious.
I'll train my dataset little by little, but does deactivating caching help here? What's the downside of performance?

Hey @asennoussi !

This file is likely the arrow file for your dataset, i.e. the cached file for the pre-processed version of your dataset (input features + labels). See https://huggingface.co/docs/datasets/about_cache#the-cache for details!

You can disable caching (see https://huggingface.co/docs/datasets/cache#enable-or-disable-caching). The pros here are that you save disk space, the cons are that you have to repeat any data pre-processing steps if you want to train on the same dataset in a second training run (essentially repeating the pre-processing that you performed before).

Alternatively, you can look into streaming mode to bypass any disk space constraints! See https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet for an explanation on how this works and https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event for streaming mode resources (such as fine-tuning scripts)

Awesome! thanks a lot!
Please bear with me, now that I fine tuned my model, I have a new directory that has a bunch of checkpoints.
How do I use the newly fine-tuned model?

Hey @asennoussi !

You can load the model with pipeline and transcribe audio samples of up to arbitrary length. Just specify the path to your model directory (the output_dir you specified during training and provide the path to an audio file:

from transformers import pipeline

MODEL_PATH = "PATH/TO/MODEL"
AUDIO_PATH = "PATH/TO/AUDIO"

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_PATH,
    chunk_length_s=30,
    device=device,
)

# we override any special forced tokens for auto language detection - not necessary if you use transformers from main!
all_special_ids = pipe.tokenizer.all_special_ids
transcribe_token_id = all_special_ids[-5]
pipe.model.config.forced_decoder_ids = [[2, transcribe_token_id]]

# inference
out = pipe(audio)["text"]
print(out)

Hi @sanchit-gandhi ,
I hope all is well.
In the script for the eval_metric:

# evaluate with the 'normalised' WER
do_normalize_eval = True

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    if do_normalize_eval:
        pred_str = [normalizer(pred) for pred in pred_str]
        label_str = [normalizer(label) for label in label_str]
        # filtering step to only evaluate the samples that correspond to non-zero references:
        pred_str = [pred_str[i] for i in range(len(pred_str)) if len(label_str[i]) > 0]
        label_str = [label_str[i] for i in range(len(label_str)) if len(label_str[i]) > 0]
    
    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    return {"wer":wer}

Then when the trainer tries to save the model, I get

/usr/local/lib/python3.8/dist-packages/transformers/trainer.py in _save_checkpoint(self, model, trial, metrics)
   2236             if not metric_to_check.startswith("eval_"):
   2237                 metric_to_check = f"eval_{metric_to_check}"
-> 2238             metric_value = metrics[metric_to_check]
   2239 
   2240             operator = np.greater if self.args.greater_is_better else np.less

KeyError: 'eval_wer'

Shouldn't compute_metrics return {"eval_wer":wer} instead of {"were":wer}

Hey @asennoussi ,

The function for compute_metrics looks good to me! It's likely that the error lies in the training args.

Could you make sure --metric_for_best_model="wer" \ in your training args? See https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event#python-script

Here, we set our training args as:

python run_speech_recognition_seq2seq_streaming.py \
    --model_name_or_path="openai/whisper-small" \
    --dataset_name="mozilla-foundation/common_voice_11_0" \
    --dataset_config_name="es" \
    --language="spanish" \
    --train_split_name="train+validation" \
    --eval_split_name="test" \
    --model_index_name="Whisper Small Spanish" \
    --max_steps="5000" \
    --output_dir="./" \
    --per_device_train_batch_size="64" \
    --per_device_eval_batch_size="32" \
    --logging_steps="25" \
    --learning_rate="1e-5" \
    --warmup_steps="500" \
    --evaluation_strategy="steps" \
    --eval_steps="1000" \
    --save_strategy="steps" \
    --save_steps="1000" \
    --generation_max_length="225" \
    --length_column_name="input_length" \
    --max_duration_in_seconds="30" \
    --text_column_name="sentence" \
    --freeze_feature_encoder="False" \
    --report_to="tensorboard" \
    --metric_for_best_model="wer" \
    --greater_is_better="False" \
    --load_best_model_at_end \
    --gradient_checkpointing \
    --fp16 \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --predict_with_generate \
    --do_normalize_eval \
    --streaming \
    --use_auth_token \
    --push_to_hub

Where we have --metric_for_best_model="wer" \, which indicates that the "wer" metric is the metric to optimise our eval performance for, see https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.metric_for_best_model for details

For a simple/stupid approach, for a language that is already supported, would you say the following is true:

  1. Take my audio samples for which I have GT transcriptions
  2. Run Whisper on them, get the generated text
  3. Compare the generated text with the GT transcripts and find those that mismatch
  4. Fine-tune Whisper ONLY on the audio samples and the GT transcripts for which a mismatch was found

So in short, fine-tune only on the corrected errors, not on the entire corpus? I imagine that would drastically reduce the fine-tuning time.

Is there any caveat to my approach?

(Note: I personally am interested in domain-specific fine-tuning, so there is a certain number of brand names, person names and domain-specific jargon that the model gets wrong to some extent, and I'm interested in fixing that)

Hey @twardoch - I think this is a cool idea, but if you have the full corpus available I would encourage you to fine-tune on that.

My worry is that if we only fine-tune on a subset of the corpus, we risk Whisper overfitting to these examples. The examples that Whisper initially makes errors on will be from a subset of the full distribution of data, so we risk overfitting Whisper to a small subset of the overall distribution.

  • Suppose we have five examples: A, B, C, D, E
  • Whisper initially makes errors on A, B, C, so we fine-tune it on these three examples
  • After training for several epochs, it should no longer make errors on examples A, B, C - great!
  • However, there's nothing stopping it from now making errors on examples D and E!

If we want to encourage Whisper to work on the full distribution of data, we should provide it training data drawn from the full distribution of data (i.e. all five training examples). Keeping examples where Whisper initially works ensures that Whisper continues to get these examples right after fine-tuning

Thanks! That's exactly what I imagined might happen, but wasn't sure if it would. You’re saying this is the risk, which I understand now. OK, fortunately my total corpus is not that huge overall.

What is the best/recommended approach for rapid prototyping? I'm already using the .tiny models to run initial tests, but I have found that the amount of data seems to make no difference. I had naively expected that it would be much faster to finetune with 8hrs of data rather than 100hrs, and that this advantage would stack with the smaller base model. But it seems like the amount of finetuning data has no impact on expected training time. I've tried now two attempts with .tiny, one with 100hrs of data and one with 10hrs and they both provide the exact same expected completion duration. What am I misunderstanding here? @sanchit-gandhi I wonder if you have some expert observations here.

Hey @None ! Could you share your training configuration (i.e. your training args)? My reckoning is that we're setting --max_steps=50000, which means that we'll train for 50k training steps no matter how much data we provide.

If you want to train based on the amount of data you have, you can remove --max_steps and set --num_train_epochs instead (see docs). If we do this, we'll train for a fixed number of epochs, so we'll scale our training time with the amount of data that we've got

Hey @asennoussi !

The warning message suggests to me that something is going wrong either in the data pre-processing stage. We shouldn't have features like audio and sentence forwarded to our data collator.

Data pre-processing

Currently, our data pre-processing function looks as follows:

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

This assumes that our audio dataset has a column called audio which is the loaded audio array.

If we're working with a .csv file containing pairs of (path to audio, text), we first need to load the audio samples before pre-processing them with our processor class. We can do this in one of two ways:

  1. Load your dataset as a HF dataset
  2. Load each audio sample on the fly

For 1, you can follow this guide: https://huggingface.co/docs/datasets/audio_dataset#create-an-audio-dataset. Once you've created your audio dataset and pushed it to the Hub, you can simply load it using the load_dataset function and follow the blog post from start to finish!

For 2, we can make a few modifications to the prepare_dataset function to first load our audio from the path. We can do this with the librosa library:

import librosa  

def prepare_dataset(batch):
  
     # load audio sample FROM PATH with specified sampling rate   
    audio_array, sampling_rate = librosa.load(batch[“audio_path”], sr=16000, mono=True)

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio_array, sampling_rate=sampling_rate).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

You’ll just need to double check that the function has the right feature names to match your dataset (I’ve used “audio_path” and “sentence” but you might need to change these)

Let me know if that helps with your question! Feel free to post a code snippet / Colab link to your code if you want to share what you're currently doing (this might make it easier to fully dissect what's going on)

Is feature_extractor is librosa fucntion or your own function?
Edit :
found out the function.
I am currently trying how to train a model with single audio file can anyone have ideas about it with custom audio file (without HF hub)?
Thank in advance

This comment has been hidden

how can we use the hugging face whisper model to fine-tune for language detection

Hey @johnwick999 ! You can check out this page for converting a custom dataset to HF datasets: https://huggingface.co/docs/datasets/audio_dataset#create-an-audio-dataset

Once you've done so, you'll be able to run the fine-tuning script exactly as is (just update the dataset id from mozilla-foundation/common_voice_11_0 to your dataset id).

Hey @Sibadatta !

Here's a code snippet for how you can use the pre-trained Whisper model for language detection:

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers.models.whisper.tokenization_whisper import LANGUAGES

from datasets import load_dataset

model_id = "openai/whisper-tiny"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

bos_token_id = processor.tokenizer.all_special_ids[-106]
decoder_input_ids = torch.tensor([bos_token_id])

dataset = load_dataset("facebook/multilingual_librispeech", "dutch", split="validation", streaming=True)
sample = next(iter(dataset))["audio"]

input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features

with torch.no_grad():
    logits = model.forward(input_features, decoder_input_ids=decoder_input_ids).logits

pred_ids = torch.argmax(logits, dim=-1)
lang_ids = processor.decode(pred_ids[0])

lang_ids = lang_ids.lstrip("<|").rstrip("|>")
language = LANGUAGES[lang_ids]

I've also created a space here: https://huggingface.co/spaces/sanchit-gandhi/whisper-language-id

To fine-tune for language detection, you can adapt the code snippet to compute a cross-entropy loss between the pred ids and the target ids

Hey @johnwick999 ! You can check out this page for converting a custom dataset to HF datasets: https://huggingface.co/docs/datasets/audio_dataset#create-an-audio-dataset

Once you've done so, you'll be able to run the fine-tuning script exactly as is (just update the dataset id from mozilla-foundation/common_voice_11_0 to your dataset id).

Thanks for the info.
I have been getting this error while loading the trained model. It says config.json not found in the model folder. Do you encounter this issue
Screenshot 2023-02-23 012906.png
Thanks in advance

Hey @Sibadatta !

Here's a code snippet for how you can use the pre-trained Whisper model for language detection:

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers.models.whisper.tokenization_whisper import LANGUAGES

from datasets import load_dataset

model_id = "openai/whisper-tiny"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

bos_token_id = processor.tokenizer.all_special_ids[-106]
decoder_input_ids = torch.tensor([bos_token_id])

dataset = load_dataset("facebook/multilingual_librispeech", "dutch", split="validation", streaming=True)
sample = next(iter(dataset))["audio"]

input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features

with torch.no_grad():
    logits = model.forward(input_features, decoder_input_ids=decoder_input_ids).logits

pred_ids = torch.argmax(logits, dim=-1)
lang_ids = processor.decode(pred_ids[0])

lang_ids = lang_ids.lstrip("<|").rstrip("|>")
language = LANGUAGES[lang_ids]

I've also created a space here: https://huggingface.co/spaces/sanchit-gandhi/whisper-language-id

To fine-tune for language detection, you can adapt the code snippet to compute a cross-entropy loss between the pred ids and the target ids

Scenario , let say i have 4-5 languages that i finally want to keep in my final model, and i want the model to detect those 4-5 language perfectly, so i fine-tune it with those much language's ASR data and language tokens for language detection. how can i perform multilingual and multitask fine-tuning along with fine-tuning its language detection decoder head.

Hello! I'm new in the field and I wanted to ask: how to fine-tune Whisper for a low-resource language that is not included in the pre-trained model?

The language in question shares some similarities with Persian/Kurdish. I have several hours of speech data for this language, but don't understand what to do next.

Here is an example for finetuning.

https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz

Hi everyone!
I flow the tutorial from link above
I see that the model fintune work well for ASR, but I want to use it for TTS(text to speech)
How to fintune or custom the tutorial for TTS, I very appreciated for sharing link or guide
Many thanks!

Hey @Sibadatta ! I would modify the prepare_dataset function to set the tokeniser's language for each training example. For this, you just need to know the language for each sample of your dataset (which I've assumed is stored under the column language in your dataset):

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

+   language = batch["language"]  # assuming you have the language for each sample of your dataset
+   tokenizer.set_prefix_tokens(language=language)  # now switch the tokenizer language to the correct one

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

You can then fine-tune the model simultaneously on multiple languages and fine-tune all the required params. No other changes to your script are required!

Hey @NursNurs !
When you fine-tune it on a new language, Whisper does a pretty good job at leveraging its knowledge of the other 96 languages it’s pre-trained on. So you still probably only need 10s of hours of labelled audio data.

We tried two ways of setting it up for fine-tuning on new languages:

  1. Remove the language prediction task so that Whisper doesn’t get caught up with the fact it’s working on a new language and just focuses on transcribing text as accurately as possible (just set language=None in the tokenizer and processor)
  2. Keep the language prediction task and tell Whisper that the new language is the same as one of it’s current languages (e.g. if fine-tuning on Nepali, tell Whisper it’s actually predicting Hindi, since the two are linguistically most similar): our thinking here was that we’d be able to leverage Whisper’s knowledge of the most linguistically similar language to the new language that we were showing it (just set language=Hindi in the tokenizer and processor)

In the end, 1 & 2 gave very comparable performance, so Whisper figures out how to make use of its existing knowledge itself, so you can set language to either of the above two options

Hey @tupk ! Whisper is a model for speech-to-text, so we can't use it for text-to-speech unfortunately. I would advise that you check-out SpeechT5 for a model that can do both: https://huggingface.co/blog/speecht5

This comment has been hidden

Do we need to train the encoder as well while fine-tuning, or just the decoder part. @sanchit-gandhi

Super good question @Sibadatta - you can freeze the encoder if your audio domain matches that seen during pre-training. Then you only need to adapt the decoder to the target text format! We did this for the ESB paper and it worked very well: https://arxiv.org/abs/2210.13352 See page 22 for details.

You can freeze the encoder by passing --freeze_encoder=True, see https://github.com/huggingface/transformers/blob/01203475c9452af74ef8fe43c64203be0c959191/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py#L100

In Colab, you'll need to do:

model.freeze_encoder()
``
before you pass the model to the trainer.

After Fine tuning how will be sure that this wer is good enough? and we don't need to fine tune further Specifically for odia language fine tuning? @sanchit-gandhi

@sanchit-gandhi How to make sure that sequential finetuning works (finetuning with lang x followed by finetuning with lang y)? I have tried finetuning on Language 1 and then on Language 2. On evaluation with test data, it seems that the WER for language 1 got increased after finetuning its last checkpoint with language 2. I don't think it is expected behavior or am I missing something here? How should one go about training multiple languages one after the other, also how does one finetune it for unsupported languages in Whisper?

Hey @Ranjit , here you can use a held-out validation set to measure the performance of your fine-tuned model on unseen data. If your validation WER is less than a pre-defined threshold, you know that your model is 'good enough' and that you can use it with out any further fine-tuning. See https://huggingface.co/course/en/chapter5/3?fw=pt#creating-a-validation-set and https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets for details.

Hey @Jiltseb , in this case you will probably get better results fine-tuning on language 1 and language 2 at the same time. For this, you will need to switch the language code in your tokenizer depending on the language for each individual sample, so that the model learns to differentiate between language 1 and 2. We'll take the Whisper fine-tuning blog post as our starting point. Suppose your dataset has a column called "language" that says what the language is for each sample (e.g. "Hindi" or "French"), then we can update our prepare_dataset function as follows:

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

   # get the language of our text
   tokenizer.set_prefix_tokens(language=batch["language"], task="transcribe") 
   # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

It's pretty easy to add a new "language" column to your dataset. Suppose you load the Hindi version of common voice as:

from datasets import load_dataset, DatasetDict

common_voice_hi = DatasetDict()

common_voice_hi["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train+validation", use_auth_token=True)
common_voice_hi["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="test", use_auth_token=True)

You can then add the language column using dataset's add_column method:

for split in common_voice_hi:
    language_column = ["Hindi"] * len(common_voice_hi[split])
    common_voice_hi[split] = common_voice_hi[split].add_column("language", language_column)

Supposing we do the same for "French":

common_voice_fr = DatasetDict()

common_voice_fr["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="train+validation", use_auth_token=True)
common_voice_fr["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", use_auth_token=True)

for split in common_voice_fr:
    language_column = ["French"] * len(common_voice_fr[split])
    common_voice_fr[split] = common_voice_fr[split].add_column("language", language_column)

We can now combine our two datasets using concatenate_datasets:

from datasets import concatenate_datasets

common_voice_merged = DatasetDict()

for split in common_voice_hi:
    common_voice_merged[split] = concatenate_datasets([common_voice_hi[split], common_voice_fr[split])

Voila! Now you can use this combined dataset for training on two languages at once.

Hi @sanchit-gandhi , Thank you for your input. I was more worried about the catastrophic forgetting and wanted to ensure the model kept the same performance for performant languages even after fine-tuning on others. PeFT with LoRA seems to be a better choice for this. How can we finetune whisper for audio classification tasks? Is there a blog/example notebook for this?

Hey @Jilt ! PeFT + LoRA is indeed a cheap way of fine-tuning the Whisper model, one that retains 99% of the original pre-trained params. The run_audio_classification.py script in transformers now supports the Whisper model. This is the script that I used to fine-tune the base model on the Common Language ID task: https://huggingface.co/sanchit-gandhi/whisper-base-ft-common-language-id/blob/main/run.sh

sanchit-gandhi changed discussion status to closed
This comment has been hidden

Hey @asennoussi !

You can load the model with pipeline and transcribe audio samples of up to arbitrary length. Just specify the path to your model directory (the output_dir you specified during training and provide the path to an audio file:

from transformers import pipeline

MODEL_PATH = "PATH/TO/MODEL"
AUDIO_PATH = "PATH/TO/AUDIO"

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_PATH,
    chunk_length_s=30,
    device=device,
)

# we override any special forced tokens for auto language detection - not necessary if you use transformers from main!
all_special_ids = pipe.tokenizer.all_special_ids
transcribe_token_id = all_special_ids[-5]
pipe.model.config.forced_decoder_ids = [[2, transcribe_token_id]]

# inference
out = pipe(AUDIO_PATH)["text"]
print(out)

minor type fixed

Hey, i am getting a tensor mismatch error is there a way to verify that,or can I skip the batches as it is a large datasets

Hi @sanchit-gandhi , in DataCollatorSpeechSeq2SeqWithPadding of https://huggingface.co/blog/fine-tune-whisper, there is a step:

#if bos token is appended in previous tokenization step, cut bos token here as it's append later anyways

Can you please point me to the code where bos token will be appended later? I tried to locate that but haven't found yet.

Thank you :)

Hi, what is the range of token_ids that the gerate() function can generate? i am trying to fine-tune whisper to learn the speaker_id just before the start of transcription using token_ids > 50363

Here is an example for finetuning.

https://colab.research.google.com/drive/1P4ClLkPmfsaKn2tBbRp0nVjGMRKR-EWz

@averoo THX for your great script, but I encountered a problem that I wonder whether you have the same problem.

In my test, the saved checkpoint works not well. I mean, before reload the last checkpoint, the validation loss is 1.7, and after if you do not restart the whole script, but only reload the last checkpoint, the validation loss is 1.7, as expected. BUT, when I restart the script, load the last checkpoint, then the validation loss is 6.4!

This is so wired, and I tried my best to check where is wrong. I checked all the parameter names in checkpoint and in the model, and they are the same except there is 1 more parameter, 'encoder.positional_embedding', in the checkpoint, which is also as expected because it is in buffer. So this can not be the problem.

My question is why after restart the script, the model has a much higher loss?

Hey @Jiltseb , in this case you will probably get better results fine-tuning on language 1 and language 2 at the same time. For this, you will need to switch the language code in your tokenizer depending on the language for each individual sample, so that the model learns to differentiate between language 1 and 2. Suppose your dataset has a column called "language" that says what the language is for each sample (e.g. "Hindi" or "French"), then we can update our prepare_dataset function as follows:

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

   # get the language of our text
   tokenizer.set_prefix_tokens(language=batch["language"], task="transcribe") 
   # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

It's pretty easy to add a new "language" column to your dataset. Suppose you load the Hindi version of common voice as:

from datasets import load_dataset, DatasetDict

common_voice_hi = DatasetDict()

common_voice_hi["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train+validation", use_auth_token=True)
common_voice_hi["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="test", use_auth_token=True)

You can then add the language column using dataset's add_column method:

for split in common_voice_hi:
    language_column = ["Hindi"] * len(common_voice_hi[split])
    common_voice_hi[split] = common_voice_hi[split].add_column("language", language_column)

Supposing we do the same for "French":

common_voice_fr = DatasetDict()

common_voice_fr["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="train+validation", use_auth_token=True)
common_voice_fr["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", use_auth_token=True)

for split in common_voice_fr:
    language_column = ["French"] * len(common_voice_fr[split])
    common_voice_fr[split] = common_voice_fr[split].add_column("language", language_column)

We can now combine our two datasets using concatenate_datasets:

from datasets import concatenate_datasets

common_voice_merged = DatasetDict()

for split in common_voice_hi:
    common_voice_merged[split] = concatenate_datasets([common_voice_hi[split], common_voice_fr[split])

Voila! Now you can use this combined dataset for training on two languages at once.

Hi @sanchit-gandhi !
Thank you very much for this valuable demonstration. However, I have been doing some tests and I don't see much difference between the results after finetuning by changing the language in the tokenizer and after finetuning without indicating the language. Both improve the performance of the base model. Does this make sense? Am I doing something wrong? I am finetuning on 6 languages and including one under-represented language (Galician).

On the other hand, I am noticing that the ability of the model to identify the language (LID) worsens noticeably after finetuning (monolingual or multilingual).
I am commenting it in this github post:
https://github.com/openai/whisper/discussions/1454

Is this to be expected? Is there a way to perform finetuning in both tasks?

Thanks!

This comment has been hidden

Hi @sanchit-gandhi !
I used your tutorial to finetune the whisper model on a local dataset. thank you very much, it was really helpful.
My issue is when I am mapping the prepare_dataset function on my data it takes really long time and my code crashes. I am training on a 20 hours dataset and my GPU is 16G.

common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=4)

is there any way to make the prepare_dataset function work more efficiently?
I already downsample my data to 16hkHz and all the audio files are less than 30 seconds, it is exactly structured as the HF dataset.
I wonder if maybe for a local dataset, I should change the prepare_dataset function.
could you please help me and explain on a local dataset which parts and how it should be changed?
sorry if my questions sound silly, I am really new in this field.

Hello mahnaz, I am working on the same model and had the same issue. Once I got that fixed I got to new issue with the shape of my input file.

May I know if you got it working yet ?

Hi @pedramaa ,
unfortunately, I have not found anything helpful yet, for me, it works only if I reduced my dataset size( only using 4 hours of data). I run the code on two GPUs (one 3090 and one 1080) parallelly. If you find anything helpful please share it with me. Thank you so much

I want to fine tune whisper for multiple language(Chinese and Tagalog), this is my code:
The tokenizer will use many place, I only change the prepare_dataset function. Will it work?

from dataclasses import dataclass
from typing import Any, List, Dict, Union
import re
import evaluate
import torch
from datasets import load_dataset, Audio, metric
from transformers import WhisperProcessor, WhisperForConditionalGeneration, \
    Seq2SeqTrainingArguments, Seq2SeqTrainer

model_name_or_path = 'openai/whisper-medium'
output_dir = "./whisper-medium-zh-tl"
data_dir = "./dataset"

processor = WhisperProcessor.from_pretrained(model_name_or_path, language=None, task="transcribe")

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    processor.tokenizer.set_prefix_tokens(language=batch["language"], task="transcribe") 
    # encode target text to label ids
    batch["labels"] = processor.tokenizer(batch["transcription"]).input_ids
    return batch


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels
        return batch


def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = self.processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = self.processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}


if __name__ == "__main__":
    common_voice = load_dataset("audiofolder", data_dir=data_dir)
    common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))
    common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=8)
    print(common_voice)
    train_samples = len(common_voice["train"])

    data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)
    metric = evaluate.load("wer")
    model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path)
    model.config.forced_decoder_ids = None
    model.config.suppress_tokens = []

    training_args = Seq2SeqTrainingArguments(
      output_dir=output_dir,  # change to a repo name of your choice
      per_device_train_batch_size=8,
      gradient_accumulation_steps=2,  # increase by 2x for every 2x decrease in batch size
      num_of_epchos=5,
      learning_rate=1e-5,
      warmup_steps=500,
      gradient_checkpointing=True,
      fp16=True,
      evaluation_strategy="steps",
      per_device_eval_batch_size=8,
      predict_with_generate=True,
      generation_max_length=225,
      save_steps=train_samples*10,  
      eval_steps=1000,
      logging_steps=25,
      report_to=["tensorboard"],
      load_best_model_at_end=True,
      metric_for_best_model="wer",
      greater_is_better=False,
      push_to_hub=False
    )

    trainer = Seq2SeqTrainer(
        args=training_args,
        model=model,
        train_dataset=common_voice["train"],
        eval_dataset=common_voice["test"],
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        tokenizer=processor.feature_extractor,
    )

    trainer.train()
    trainer.save_model(output_dir)

Hi @pedramaa ,
unfortunately, I have not found anything helpful yet, for me, it works only if I reduced my dataset size( only using 4 hours of data). I run the code on two GPUs (one 3090 and one 1080) parallelly. If you find anything helpful please share it with me. Thank you so much

Hi mahnaz,

maybe this can be helpful for you: https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event#streaming-mode

Hey @taohoang - the BOS token id is appended in the Whisper modelling code, at the point when we shift the labels right to get the decoder input ids: https://github.com/huggingface/transformers/blob/fd56f7f0813d412c3e0848cbd6f94a23de2c07b7/src/transformers/models/whisper/modeling_whisper.py#L65

Hey @faycel - the .generate method can output any of the tokens from the model's vocabulary. We first run a forward pass to get the logits over the entire vocabulary, and then sample from this distribution to predict our next token. So if you've expanded the vocabulary to > 50363 by expanding the dimensionality of the final embedding layer and also the tokeniser's vocab, then you can generate with no code changes required. See this thread for details: https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311?u=nbroad

Hey @andrespm - this is expected behaviour. When you fine-tune it on a new language, Whisper does a pretty good job at leveraging its knowledge of the other 96 languages it’s pre-trained on.

Pretty much all modern languages will be linguistically similar to at least one of the 96 languages Whisper already knows, so you’ll probably fall under this paradigm of cross-lingual knowledge representations

We tried two ways of setting it up for fine-tuning on new languages:

  1. Remove the language prediction task so that Whisper doesn’t get caught up with the fact it’s working on a new language and just focuses on transcribing text as accurately as possible
  2. Keep the language prediction task and tell Whisper that the new language is the same as one of it’s current languages (e.g. if fine-tuning on Nepali, tell Whisper it’s actually predicting Hindi, since the two are linguistically most similar): our thinking here was that we’d be able to leverage Whisper’s knowledge of the most linguistically similar language to the new language that we were showing it

In the end, 1 & 2 gave very comparable performance, so Whisper figures out how to make use of its existing knowledge itself

Regarding maintaining LID performance after fine-tuning, you can try two strategies to reduce catastrophic forgetting (the phenomenon where a model forgets what it learnt during a prior round of training):

  1. Include data from different languages during fine-tuning, but only count the loss from the language id token towards the overall loss (i.e. discard the transcriptions if you don't want to fine-tune on these other languages, and just train the model to predict the language token)
  2. Try fine-tuning using PEFT: we've seen that the model is far less likely to catastrophically forget after PEFT fine-tuning, since the base model weighs are frozen. See Vaibhavs10/fast-whisper-finetuning for details

Overall, I think option 2 is the easier of the two here

Hey @mahnaz - if you're running into issues preparing the dataset, you can try tweaking the datasets parameters for the .map method. I would recommend using num_proc=1 to start with, since using more than this is probably crashing your system if you don't have the required CPUs

Hey @LukeJacob2023 - indeed that should work! Looks like you've got your data in the right format for multi-language fine-tuning! How did you get on here? Did the fine-tuned model improve compared to the pre-trained one?

Hey @LukeJacob2023 - indeed that should work! Looks like you've got your data in the right format for multi-language fine-tuning! How did you get on here? Did the fine-tuned model improve compared to the pre-trained one?

Thank you. I have successfully fine-tune it by set language to None. Because I am afraid the tokenzier may cause error. The model works ok.

Hi @sanchit-gandhi ,
I am working on adding wolof language datasets in mozilla/common, where it is not yet available.
Do you reckon it would be possible to build and add it on top of your common-language-id ?
Or you would recommend training from whispers with all commons over again ?

Hi @sanchit-gandhi ,

I'm reaching out to seek guidance on how to resolve this error. When attempting to use the trainer.push_to_hub(**kwargs) function to push a model to the Hugging Face model hub, I encounter an HTTPError followed by a BadRequestError. Here's a brief overview of the error messages:

image.png

Hi @sanchit-gandhi , I have a question about changing the number of steps while training whisper model. Would doubling the steps number give me any significant improvement? Also is there a way to know how to select the right number for the training steps?

Hi @sanchit-gandhi , How to fine-tune the 'translate' task ? not 'transcribe'

hi @sanchit-gandhi , i want to perform transfer learning only for identifying the language and further finetune it for their dialects. Also these dialects or the language might not be a part of the existing model architecture , how to do this im confused

Does anyone know why the WER does not decay? I'm training the medium model for low resource language ?

medium--1e-5--32--16 - p3.png

per_device_train_batch_size="32"
per_device_eval_batch_size="16"
learning_rate="1e-5"

Hi @sanchit-gandhi
We need your expert opinion on which model would be best suited for fine tuning in our case, where we are mostly interested in our English data (we have a dataset of a few 10s of GBs only) and the best accuracy possible on that.
Also, do you recommend using any distil-whisper model in spite of the original whisper for the same?

Kindly let me know your views.
Thanks a lot

Hello @sanchit-gandhi san,

Thank you so much for providing such a detailed notebook for fine-tuning whisper.

I have a question regarding how to set up the LoraConfig so that the target_modules only targets ["q_proj", "v_proj"] of the decoder stack.
It seems that both encoder and decoder uses the same module names, so setting the target_modules to ["q_proj", "v_proj"] creates lora layers for both encoder and decoder.
How can I target the decoder's attention layers?

Original Whisper
OrderedDict([('model', WhisperModel(
(encoder): WhisperEncoder(
(conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
(embed_positions): Embedding(1500, 768)
(layers): ModuleList(
(0-11): 12 x WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(q_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
(fc2): Linear8bitLt(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(decoder): WhisperDecoder(
(embed_tokens): Embedding(51865, 768, padding_idx=50257)
(embed_positions): WhisperPositionalEmbedding(448, 768)
(layers): ModuleList(
(0-11): 12 x WhisperDecoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(q_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(encoder_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(q_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
(fc2): Linear8bitLt(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)), ('proj_out', Linear(in_features=768, out_features=51865, bias=False))])

After get_peft_model
OrderedDict([('base_model',
LoraModel(
(model): WhisperForConditionalGeneration(
(model): WhisperModel(
(encoder): WhisperEncoder(
(conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
(embed_positions): Embedding(1500, 768)
(layers): ModuleList(
(0-11): 12 x WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(q_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
(fc2): Linear8bitLt(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(decoder): WhisperDecoder(
(embed_tokens): Embedding(51865, 768, padding_idx=50257)
(embed_positions): WhisperPositionalEmbedding(448, 768)
(layers): ModuleList(
(0-11): 12 x WhisperDecoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(q_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(encoder_attn): WhisperAttention(
(k_proj): Linear8bitLt(in_features=768, out_features=768, bias=False)
(v_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(q_proj): Linear8bitLt(
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=768, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=768, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(base_layer): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(out_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear8bitLt(in_features=768, out_features=3072, bias=True)
(fc2): Linear8bitLt(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(proj_out): Linear(in_features=768, out_features=51865, bias=False)
)
))])

Hi @sanchit-gandhi , I am fine-tuning, whisper small on Hindi data. While fine-tuning validation WER is decreasing but the validation loss is increasing. It seems like it is overfitting. How can I solve this, means what sort of regularization I can use? And also any advice on the warm up steps hyperparameter, means any recommended value? Can anyone please help? Thanks in advance.

@sanchit-gandhi
I used your tutorial to finetune the whisper model on a local dataset. thank you very much, it was really helpful.
My issue is in this step

from transformers import Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-small-hi", # change to a repo name of your choice
)

actually I am new in the field and I am not able to understand what is the output directory.
could I use a local directory of my system instead of "./whisper-small-hi".
and I also want to know that how and from where I can I use this finetuned model to test it .

Hi @sanchit-gandhi , thank u for posting such a detailed article to fine rune whisper with custom training data.
I have followed the article and could generate the model with my own training dataset .
I have doubt ,when i run, trainer.tain() , i can see the training starts and check points are store in different directory

I have set updating checkpoints to Hugging face as false, as i want save the model locally

After this step i am running save model to local directory, thats also working fine.but checkpoints and save model are done to different directory

My question here is
1.When i load trained model from local directory,,which directory path i have to provide, last checkpoint dir or model saved path???.

2.And the other issue is checkpoint dir do not have files like vocab.json etc..and loding fails, for this ,,workaround i did was to copy files from saved model dir to checkpoint dir

Kindly help me with my queries

Please find the Training steps: ********************************************

training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-muttu-small-vv-trained", # change to a repo name of your choice
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=500,
gradient_checkpointing=True,
fp16=False,
evaluation_strategy="steps",
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=100,
eval_steps=100,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=False,
use_cpu=True
)
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=common_voice["train"],
eval_dataset=common_voice["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor
)
#this is trainng and generating check point
trainer.train()

#save trainer model and processors.
trainer.save_model(training_args.output_dir)
processor.save_pretrained(training_args.output_dir)

Files stored under checkpoint dir ********************************************
3815 May 12 06:21 generation_config.json
2260 May 12 06:21 config.json
966995080 May 12 06:21 model.safetensors
5240 May 12 06:21 training_args.bin
339 May 12 06:21 preprocessor_config.json
1064 May 12 06:21 scheduler.pt
13990 May 12 06:21 rng_state.pth
1925050668 May 12 06:21 optimizer.pt
4436 May 12 06:21 trainer_state.json

@sanchit-gandhi on indic languages I am looking to finetune the whisper model by generating audios which has mixture of numbers and domain specific names plus the public dataset available on the web because it seems like words which are homophones to each other and number especially it's making a lot of mistakes adding three zeros when number "100" is spoken and so on.Nonetheless numbers are spoken differently in different languages.So training with a mixture of public dataset and the synthetic data on different indic languages should solve this issues but my concern is will it be able to retain parameters like initialprompt,repetition penalty,vad filter,hotwords(which faster whisper provides) etc after fine-tuning using Peft or without Peft?

Based on this guide https://huggingface.co/blog/fine-tune-whisper, I tried to fine-tune "small" and "large-v3" models.

  • The fine-tuned "small" model works normally.
  • But the fine-tuned "large-v3" model works poorly on non-English audio files such as Chinese audio files, it auto-translates Chinese to English though I specified transcribing in Chinese, not translating.
    Have you faced this issue, and can give me advice? Thank you so much.

Hey @Jiltseb , in this case you will probably get better results fine-tuning on language 1 and language 2 at the same time. For this, you will need to switch the language code in your tokenizer depending on the language for each individual sample, so that the model learns to differentiate between language 1 and 2. We'll take the Whisper fine-tuning blog post as our starting point. Suppose your dataset has a column called "language" that says what the language is for each sample (e.g. "Hindi" or "French"), then we can update our prepare_dataset function as follows:

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

   # get the language of our text
   tokenizer.set_prefix_tokens(language=batch["language"], task="transcribe") 
   # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

It's pretty easy to add a new "language" column to your dataset. Suppose you load the Hindi version of common voice as:

from datasets import load_dataset, DatasetDict

common_voice_hi = DatasetDict()

common_voice_hi["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train+validation", use_auth_token=True)
common_voice_hi["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="test", use_auth_token=True)

You can then add the language column using dataset's add_column method:

for split in common_voice_hi:
    language_column = ["Hindi"] * len(common_voice_hi[split])
    common_voice_hi[split] = common_voice_hi[split].add_column("language", language_column)

Supposing we do the same for "French":

common_voice_fr = DatasetDict()

common_voice_fr["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="train+validation", use_auth_token=True)
common_voice_fr["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", use_auth_token=True)

for split in common_voice_fr:
    language_column = ["French"] * len(common_voice_fr[split])
    common_voice_fr[split] = common_voice_fr[split].add_column("language", language_column)

We can now combine our two datasets using concatenate_datasets:

from datasets import concatenate_datasets

common_voice_merged = DatasetDict()

for split in common_voice_hi:
    common_voice_merged[split] = concatenate_datasets([common_voice_hi[split], common_voice_fr[split])

Voila! Now you can use this combined dataset for training on two languages at once.

@sanchit-gandhi First of all, thank you for your amazing blog on fine-tuning
This answer is a great place to start on multi-language finetuning
I was trying this out. I wanted to ask you about how to train to translate and transcribe for 5 languages. Should I go sequentially by doing transcription fine-tuning and translation or is there a better approach?

Sign up or log in to comment