alvanlii
/

whisper-small-cantonese

@@ -14,29 +14,49 @@ model-index:
       name: Automatic Speech Recognition
       type: automatic-speech-recognition
     dataset:
-      name: mozilla-foundation/common_voice_11_0 zh-HK
-      type: mozilla-foundation/common_voice_11_0
-      config: zh-HK
       split: test
-      args: zh-HK
     metrics:
     - name: Normalized CER
       type: cer
-      value: 10.11
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
 # Whisper Small zh-HK - Alvin
-This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Common Voice 11.0 dataset. This version has a lower CER (by 1%) compared to the previous one.
 ## Training and evaluation data
-For training, three datasets were used:
-- Common Voice 11 Canto Train Set
 - CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
 - Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf
 ## Using the Model
 ```
 import librosa
@@ -77,27 +97,61 @@ pipe = pipeline(
 pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
 text = pipe(file)["text"]
 ```
 ## Training Hyperparameters
 - learning_rate: 5e-5
-- train_batch_size: 25 (on 2 GPUs)
 - eval_batch_size: 8
-- gradient_accumulation_steps: 2
-- total_train_batch_size: 25x2x2=100
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - lr_scheduler_type: linear
 - lr_scheduler_warmup_steps: 500
-- training_steps: 14000
-- mixed_precision_training: Native AMP
-- augmentation: SpecAugment
-## Training Results
-| Training Loss | Epoch | Step | Validation Loss | Normalized CER    |
-|:-------------:|:-----:|:----:|:---------------:|:------:|
-| 0.4610        | 0.55  | 2000 | 0.3106          | 13.08 |
-| 0.3441        | 1.11  | 4000 | 0.2875          | 11.79 |
-| 0.3466        | 1.66  | 6000 | 0.2820          | 11.44 |
-| 0.2539        | 2.22  | 8000 | 0.2777          | 10.59 |
-| 0.2312        | 2.77  | 10000 | 0.2822          | 10.60 |
-| 0.1639        | 3.32  | 12000 | 0.2859          | 10.17 |
-| 0.1569        | 3.88  | 14000 | 0.2866          | 10.11 |

       name: Automatic Speech Recognition
       type: automatic-speech-recognition
     dataset:
+      name: mozilla-foundation/common_voice_16_0 yue
+      type: mozilla-foundation/common_voice_16_0
+      config: yue
       split: test
+      args: yue
     metrics:
     - name: Normalized CER
       type: cer
+      value: 10.73
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
 # Whisper Small zh-HK - Alvin
+This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Cantonese language. It achieves a 10.73 CER on Common Voice 16.0
 ## Training and evaluation data
+For training,
+|Name|# of Hours|
+|--|--|
+|Common Voice 16.0 zh-HK Train|138|
+|Common Voice 16.0 yue Train|85|
+|Cantonese-ASR|72|
+|CantoMap|23|
+|Pseudo-Labelled YouTube Data|438|
+|Total|756|
 - CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
 - Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf
+For evaluation, Common Voice 16.0 yue Test set is used.
+## Results
+- CER (lower is better): 0.1073
+  - down from 0.1581 in the previous version dated Jan 28, 2023
+- GPU Inference with Fast Attention (example below): 0.055s/sample
+  - Note all GPU evaluations are done on RTX 3090 GPU
+- GPU Inference: 0.308s/sample
+- CPU Inference: 2.57s/sample
+- GPU VRAM: ~1.5 GB
 ## Using the Model
 ```
 import librosa
 pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
 text = pipe(file)["text"]
 ```
+## Model Speedup
+Just add attn_implementation="sdpa" for Flash Attention.
+```
+model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    "alvanlii/whisper-small-cantonese",
+    torch_dtype=torch_dtype,
+    low_cpu_mem_usage=True,
+    use_safetensors=True,
+    attn_implementation="sdpa",
+)
+```
+Using Flash Attention reduced the amount of time taken per sample from 0.308s to 0.055s.
+## Speculative Decoding
+You can use a bigger model, then use `alvanlii/whisper-small-cantonese` to speed up inference with basically no loss in accuracy.
+```
+model_id = "simonl0909/whisper-large-v2-cantonese"
+model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    model_id,
+    torch_dtype=torch_dtype,
+    low_cpu_mem_usage=True,
+    use_safetensors=True,
+    attn_implementation="sdpa",
+)
+model.to(device)
+processor = AutoProcessor.from_pretrained(model_id)
+assistant_model_id = "alvanlii/whisper-small-cantonese"
+assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    assistant_model_id,
+    torch_dtype=torch_dtype,
+    low_cpu_mem_usage=True,
+    use_safetensors=True,
+    attn_implementation="sdpa",
+)
+assistant_model.to(device)
+...
+model.generate(**inputs, use_cache=True, assistant_model=assistant_model)
+```
+In the original `simonl0909/whisper-large-v2-cantonese` model, it runs at 0.714s/sample for a CER of 7.65. \
+Using speculative decoding with `alvanlii/whisper-small-cantonese`, it runs at 0.137s/sample for a CER of 7.67, which is much faster.
 ## Training Hyperparameters
 - learning_rate: 5e-5
+- train_batch_size: 25 (on 1 3090 GPU)
 - eval_batch_size: 8
+- gradient_accumulation_steps: 4
+- total_train_batch_size: 25x4=100
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - lr_scheduler_type: linear
 - lr_scheduler_warmup_steps: 500
+- training_steps: 15000
+- augmentation: None