alvanlii commited on
Commit
c8c2a58
1 Parent(s): 3cd5b5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -26
README.md CHANGED
@@ -14,29 +14,49 @@ model-index:
14
  name: Automatic Speech Recognition
15
  type: automatic-speech-recognition
16
  dataset:
17
- name: mozilla-foundation/common_voice_11_0 zh-HK
18
- type: mozilla-foundation/common_voice_11_0
19
- config: zh-HK
20
  split: test
21
- args: zh-HK
22
  metrics:
23
  - name: Normalized CER
24
  type: cer
25
- value: 10.11
26
  ---
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment. -->
29
 
30
  # Whisper Small zh-HK - Alvin
31
 
32
- This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Common Voice 11.0 dataset. This version has a lower CER (by 1%) compared to the previous one.
33
 
34
  ## Training and evaluation data
35
- For training, three datasets were used:
36
- - Common Voice 11 Canto Train Set
 
 
 
 
 
 
 
 
37
  - CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
38
  - Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf
39
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ## Using the Model
41
  ```
42
  import librosa
@@ -77,27 +97,61 @@ pipe = pipeline(
77
  pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
78
  text = pipe(file)["text"]
79
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  ## Training Hyperparameters
81
  - learning_rate: 5e-5
82
- - train_batch_size: 25 (on 2 GPUs)
83
  - eval_batch_size: 8
84
- - gradient_accumulation_steps: 2
85
- - total_train_batch_size: 25x2x2=100
86
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
87
  - lr_scheduler_type: linear
88
  - lr_scheduler_warmup_steps: 500
89
- - training_steps: 14000
90
- - mixed_precision_training: Native AMP
91
- - augmentation: SpecAugment
92
-
93
- ## Training Results
94
-
95
- | Training Loss | Epoch | Step | Validation Loss | Normalized CER |
96
- |:-------------:|:-----:|:----:|:---------------:|:------:|
97
- | 0.4610 | 0.55 | 2000 | 0.3106 | 13.08 |
98
- | 0.3441 | 1.11 | 4000 | 0.2875 | 11.79 |
99
- | 0.3466 | 1.66 | 6000 | 0.2820 | 11.44 |
100
- | 0.2539 | 2.22 | 8000 | 0.2777 | 10.59 |
101
- | 0.2312 | 2.77 | 10000 | 0.2822 | 10.60 |
102
- | 0.1639 | 3.32 | 12000 | 0.2859 | 10.17 |
103
- | 0.1569 | 3.88 | 14000 | 0.2866 | 10.11 |
 
14
  name: Automatic Speech Recognition
15
  type: automatic-speech-recognition
16
  dataset:
17
+ name: mozilla-foundation/common_voice_16_0 yue
18
+ type: mozilla-foundation/common_voice_16_0
19
+ config: yue
20
  split: test
21
+ args: yue
22
  metrics:
23
  - name: Normalized CER
24
  type: cer
25
+ value: 10.73
26
  ---
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment. -->
29
 
30
  # Whisper Small zh-HK - Alvin
31
 
32
+ This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Cantonese language. It achieves a 10.73 CER on Common Voice 16.0
33
 
34
  ## Training and evaluation data
35
+ For training,
36
+ |Name|# of Hours|
37
+ |--|--|
38
+ |Common Voice 16.0 zh-HK Train|138|
39
+ |Common Voice 16.0 yue Train|85|
40
+ |Cantonese-ASR|72|
41
+ |CantoMap|23|
42
+ |Pseudo-Labelled YouTube Data|438|
43
+ |Total|756|
44
+
45
  - CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
46
  - Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf
47
 
48
+ For evaluation, Common Voice 16.0 yue Test set is used.
49
+
50
+ ## Results
51
+ - CER (lower is better): 0.1073
52
+ - down from 0.1581 in the previous version dated Jan 28, 2023
53
+ - GPU Inference with Fast Attention (example below): 0.055s/sample
54
+ - Note all GPU evaluations are done on RTX 3090 GPU
55
+ - GPU Inference: 0.308s/sample
56
+ - CPU Inference: 2.57s/sample
57
+ - GPU VRAM: ~1.5 GB
58
+
59
+
60
  ## Using the Model
61
  ```
62
  import librosa
 
97
  pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
98
  text = pipe(file)["text"]
99
  ```
100
+
101
+ ## Model Speedup
102
+ Just add attn_implementation="sdpa" for Flash Attention.
103
+ ```
104
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
105
+ "alvanlii/whisper-small-cantonese",
106
+ torch_dtype=torch_dtype,
107
+ low_cpu_mem_usage=True,
108
+ use_safetensors=True,
109
+ attn_implementation="sdpa",
110
+ )
111
+ ```
112
+ Using Flash Attention reduced the amount of time taken per sample from 0.308s to 0.055s.
113
+
114
+ ## Speculative Decoding
115
+ You can use a bigger model, then use `alvanlii/whisper-small-cantonese` to speed up inference with basically no loss in accuracy.
116
+ ```
117
+ model_id = "simonl0909/whisper-large-v2-cantonese"
118
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
119
+ model_id,
120
+ torch_dtype=torch_dtype,
121
+ low_cpu_mem_usage=True,
122
+ use_safetensors=True,
123
+ attn_implementation="sdpa",
124
+ )
125
+ model.to(device)
126
+
127
+ processor = AutoProcessor.from_pretrained(model_id)
128
+
129
+ assistant_model_id = "alvanlii/whisper-small-cantonese"
130
+
131
+ assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
132
+ assistant_model_id,
133
+ torch_dtype=torch_dtype,
134
+ low_cpu_mem_usage=True,
135
+ use_safetensors=True,
136
+ attn_implementation="sdpa",
137
+ )
138
+
139
+ assistant_model.to(device)
140
+ ...
141
+ model.generate(**inputs, use_cache=True, assistant_model=assistant_model)
142
+ ```
143
+ In the original `simonl0909/whisper-large-v2-cantonese` model, it runs at 0.714s/sample for a CER of 7.65. \
144
+ Using speculative decoding with `alvanlii/whisper-small-cantonese`, it runs at 0.137s/sample for a CER of 7.67, which is much faster.
145
+
146
  ## Training Hyperparameters
147
  - learning_rate: 5e-5
148
+ - train_batch_size: 25 (on 1 3090 GPU)
149
  - eval_batch_size: 8
150
+ - gradient_accumulation_steps: 4
151
+ - total_train_batch_size: 25x4=100
152
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
153
  - lr_scheduler_type: linear
154
  - lr_scheduler_warmup_steps: 500
155
+ - training_steps: 15000
156
+ - augmentation: None
157
+