Whisper Large V2 zh-HK - Alvin
This model is a fine-tuned version of openai/whisper-large-v2 on the Common Voice 11.0 dataset. This is trained with PEFT LoRA+BNB INT8 with a Normalized CER of 7.77%
To use the model, use the following code. It should be able to inference with less than 4GB VRAM (batch size of 1).
from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer, WhisperTokenizer, WhisperProcessor
peft_model_id = "alvanlii/whisper-largev2-cantonese-peft-lora"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)
task = "transcribe"
tokenizer = WhisperTokenizer.from_pretrained(peft_config.base_model_name_or_path, task=task)
processor = WhisperProcessor.from_pretrained(peft_config.base_model_name_or_path, task=task)
feature_extractor = processor.feature_extractor
forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task)
pipe = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)
audio = # load audio here
text = pipe(audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"]
Training and evaluation data
For training, three datasets were used:
- Common Voice 11 Canto Train Set
- CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
- Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf
Training Hyperparameters
- learning_rate: 1e-3
- train_batch_size: 60 (on 1 3090 GPU)
- eval_batch_size: 10
- gradient_accumulation_steps: 1
- total_train_batch_size: 60x1x1=60
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 12000
- augmentation: SpecAugment
Training Results
Training Loss | Epoch | Step | Validation Loss | Normalized CER |
---|---|---|---|---|
0.8604 | 1.99 | 12000 | 0.2129 | 0.07766 |
Unable to determine this model's library. Check the
docs
.
Dataset used to train alvanlii/whisper-largev2-cantonese-peft-lora
Evaluation results
- Normalized CER on mozilla-foundation/common_voice_11_0 zh-HKtest set self-reported7.766