arminhaberl's picture
Duplicate from sanchit-gandhi/distil-whisper-large-v3-de-kd
9a422aa verified
metadata
license: mit
datasets:
  - mozilla-foundation/common_voice_15_0
language:
  - de
library_name: transformers
base_model: openai/whisper-large-v3
model-index:
  - name: Distil-Whisper large-v3 De
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Common Voice 15.0
          type: mozilla-foundation/common_voice_15_0
          args: 'Config: de'
        metrics:
          - type: wer
            value: 6.324
            name: Wer

Distil-Whisper large-v3 German

This model is a knowledge-distilled version of openai/whisper-large-v3 on the German subest of the Common Voice 15.0 dataset. It was trained using the Distil-Whisper training code on the knowledge-distillation objective, using the large-v3 model as the teacher.

It achieves the following WER results on the evaluation set:

  • Normalised WER: 6.324
  • Orthographic WER: 8.233

Full tensorboard logs can be found under the tab Training Metrics, and steps to reproduce here.

Model description

We copy the entire encoder module and freeze it during training. We copy only two decoder layers, which are initialised from the first and last decoder layers from Whisper. All other decoder layers from Whisper are discarded. The model is trained on a knowledge distillation objective. Specifically, it is trained to minimise the KL divergence between the distilled model and the Whisper model, as well as the cross-entropy loss on the labelled Common Voice audio data. For more details, refer to the Distil-Whisper repository and paper.

Training and evaluation data

The model was trained and evaluated on the German subset of the Common Voice 15.0 dataset.

Training procedure

To reproduce this training run, first clone and install Distil-Whisper according to the instructions here.

Next, we can pick a name for our distilled model, e.g. distil-whisper-large-v3-de-kd. We can then run the following command to create a repository under this name:

huggingface-cli repo create distil-whisper-large-v3-de-kd

We can now see the model on the Hub, e.g. under https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd

Let's clone the repository so that we can place our training script and model weights inside:

git lfs install
git clone https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd

Note: Be sure to change the repo address to https://huggingface.co/<your-user-name>/<your-repo-name>

Next, copy the relevant training scrips from Distil-Whisper to the repository:

cd distil-whisper-large-v3-de-kd

cp ../distil-whisper/training/create_student_model.py .
cp ../distil-whisper/training/run_distillation.py .

The following command demonstrates how to initialise a student model from the Whisper large-v3 checkpoint, with all 32 encoder layer and 2 decoder layers. The 2 student decoder layers are copied from teacher layers 1 and 32 respectively, as the maximally spaced layers:

#!/usr/bin/env bash

python create_student_model.py \
  --teacher_checkpoint "openai/whisper-large-v3" \
  --encoder_layers 32 \
  --decoder_layers 2 \
  --save_dir "./distil-large-v3-init"

The initialised model will be saved to the sub-directory distil-large-v3-init in our model repository, ready to be trained.

We can then train the model for a total of 50k steps on the German subset of the Common Voice 15 dataset by executing the following command. Note that we train directly on the text labels provided in the Common Voice dataset, rather than first pseudo-labelling the dataset as was done in the original Distil-Whisper paper:

#!/usr/bin/env bash

accelerate launch --mixed_precision=bf16 run_distillation.py \
  --model_name_or_path "./distil-large-v3-init" \
  --teacher_model_name_or_path "openai/whisper-large-v3" \
  --train_dataset_name "mozilla-foundation/common_voice_15_0" \
  --train_dataset_config_name "de" \
  --train_split_name "train" \
  --text_column_name "sentence" \
  --eval_dataset_name "mozilla-foundation/common_voice_15_0" \
  --eval_dataset_config_name "de" \
  --eval_split_name "validation" \
  --eval_text_column_name "sentence" \
  --eval_steps 5000 \
  --save_steps 5000 \
  --warmup_steps 500 \
  --learning_rate 1e-4 \
  --lr_scheduler_type "linear" \
  --logging_steps 25 \
  --save_total_limit 1 \
  --max_steps 50000 \
  --per_device_train_batch_size 64 \
  --per_device_eval_batch_size 64 \
  --dataloader_num_workers 16 \
  --preprocessing_num_workers 16 \
  --ddp_timeout 7200 \
  --dtype "bfloat16" \
  --output_dir "./" \
  --use_pseudo_labels "false" \
  --condition_on_prev_probability "0.0" \
  --do_train \
  --do_eval \
  --gradient_checkpointing \
  --overwrite_output_dir \
  --predict_with_generate \
  --freeze_encoder \
  --streaming \
  --push_to_hub

On a single 80GB A100 GPU, training will take approximately 3.5 days (or 85 hours), and reach a final WER of 6.3%. Tensorboard logs can be found under the tab Training Metrics. Note that training for longer would likely have improved the final WER performance further, since the model had not fully converged after 50k train steps.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-04
  • train_batch_size: 64
  • eval_batch_size: 64
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • training_steps: 50000
  • mixed_precision_training: Native AMP

Training results

Tensorboard logs can be found under the tab Training Metrics.

Framework versions

  • Transformers 4.36.0.dev0
  • Pytorch 2.1.2+cu121
  • Datasets 2.14.7.dev0
  • Tokenizers 0.14.1