license: mit
datasets:
- mozilla-foundation/common_voice_15_0
language:
- de
library_name: transformers
base_model: openai/whisper-large-v3
model-index:
- name: Distil-Whisper large-v3 De
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Common Voice 15.0
type: mozilla-foundation/common_voice_15_0
args: 'Config: de'
metrics:
- type: wer
value: 6.324
name: Wer
Distil-Whisper large-v3 German
This model is a knowledge-distilled version of openai/whisper-large-v3 on the German subest of the Common Voice 15.0 dataset. It was trained using the Distil-Whisper training code on the knowledge-distillation objective, using the large-v3 model as the teacher.
It achieves the following WER results on the evaluation set:
- Normalised WER: 6.324
- Orthographic WER: 8.233
Full tensorboard logs can be found under the tab Training Metrics, and steps to reproduce here.
Model description
We copy the entire encoder module and freeze it during training. We copy only two decoder layers, which are initialised from the first and last decoder layers from Whisper. All other decoder layers from Whisper are discarded. The model is trained on a knowledge distillation objective. Specifically, it is trained to minimise the KL divergence between the distilled model and the Whisper model, as well as the cross-entropy loss on the labelled Common Voice audio data. For more details, refer to the Distil-Whisper repository and paper.
Training and evaluation data
The model was trained and evaluated on the German subset of the Common Voice 15.0 dataset.
Training procedure
To reproduce this training run, first clone and install Distil-Whisper according to the instructions here.
Next, we can pick a name for our distilled model, e.g. distil-whisper-large-v3-de-kd
. We can then run the following command to create a repository under this name:
huggingface-cli repo create distil-whisper-large-v3-de-kd
We can now see the model on the Hub, e.g. under https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd
Let's clone the repository so that we can place our training script and model weights inside:
git lfs install
git clone https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd
Note: Be sure to change the repo address to https://huggingface.co/<your-user-name>/<your-repo-name>
Next, copy the relevant training scrips from Distil-Whisper to the repository:
cd distil-whisper-large-v3-de-kd
cp ../distil-whisper/training/create_student_model.py .
cp ../distil-whisper/training/run_distillation.py .
The following command demonstrates how to initialise a student model from the Whisper large-v3 checkpoint, with all 32 encoder layer and 2 decoder layers. The 2 student decoder layers are copied from teacher layers 1 and 32 respectively, as the maximally spaced layers:
#!/usr/bin/env bash
python create_student_model.py \
--teacher_checkpoint "openai/whisper-large-v3" \
--encoder_layers 32 \
--decoder_layers 2 \
--save_dir "./distil-large-v3-init"
The initialised model will be saved to the sub-directory distil-large-v3-init
in our model repository, ready to be trained.
We can then train the model for a total of 50k steps on the German subset of the Common Voice 15 dataset by executing the following command. Note that we train directly on the text labels provided in the Common Voice dataset, rather than first pseudo-labelling the dataset as was done in the original Distil-Whisper paper:
#!/usr/bin/env bash
accelerate launch --mixed_precision=bf16 run_distillation.py \
--model_name_or_path "./distil-large-v3-init" \
--teacher_model_name_or_path "openai/whisper-large-v3" \
--train_dataset_name "mozilla-foundation/common_voice_15_0" \
--train_dataset_config_name "de" \
--train_split_name "train" \
--text_column_name "sentence" \
--eval_dataset_name "mozilla-foundation/common_voice_15_0" \
--eval_dataset_config_name "de" \
--eval_split_name "validation" \
--eval_text_column_name "sentence" \
--eval_steps 5000 \
--save_steps 5000 \
--warmup_steps 500 \
--learning_rate 1e-4 \
--lr_scheduler_type "linear" \
--logging_steps 25 \
--save_total_limit 1 \
--max_steps 50000 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 64 \
--dataloader_num_workers 16 \
--preprocessing_num_workers 16 \
--ddp_timeout 7200 \
--dtype "bfloat16" \
--output_dir "./" \
--use_pseudo_labels "false" \
--condition_on_prev_probability "0.0" \
--do_train \
--do_eval \
--gradient_checkpointing \
--overwrite_output_dir \
--predict_with_generate \
--freeze_encoder \
--streaming \
--push_to_hub
On a single 80GB A100 GPU, training will take approximately 3.5 days (or 85 hours), and reach a final WER of 6.3%. Tensorboard logs can be found under the tab Training Metrics. Note that training for longer would likely have improved the final WER performance further, since the model had not fully converged after 50k train steps.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-04
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 50000
- mixed_precision_training: Native AMP
Training results
Tensorboard logs can be found under the tab Training Metrics.
Framework versions
- Transformers 4.36.0.dev0
- Pytorch 2.1.2+cu121
- Datasets 2.14.7.dev0
- Tokenizers 0.14.1