File size: 4,329 Bytes
3bcf5ea dc1a1b2 3bcf5ea dc1a1b2 8fbd0a8 3bcf5ea dc1a1b2 3bcf5ea dc1a1b2 3bcf5ea dc1a1b2 3bcf5ea dc1a1b2 3bcf5ea dc1a1b2 3bcf5ea dc1a1b2 3bcf5ea dc1a1b2 3bcf5ea dc1a1b2 3bcf5ea dc1a1b2 3bcf5ea eb7449f 3bcf5ea dc1a1b2 3bcf5ea eb7449f 3bcf5ea eb7449f 3bcf5ea dc1a1b2 3bcf5ea eb7449f 3bcf5ea eb7449f 3bcf5ea dc1a1b2 590f377 dc1a1b2 3bcf5ea 354d02b 3bcf5ea 354d02b 3bcf5ea 354d02b 3bcf5ea 3452924 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
---
language:
- en
license: mit
base_model: openai/whisper-small
tags:
- generated_from_trainer
metrics:
- wer
model-index:
- name: whisper-small-singlish-122k
results:
- task:
type: automatic-speech-recognition
dataset:
name: NSC
type: NSC
metrics:
- name: WER
type: WER
value: 9.69
---
# Whisper-small-singlish-122k.
This model is a [openai/whisper-small](https://huggingface.co/openai/whisper-small), fine-tuned on a subset (122k samples) of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus).
The following results on the evaluation set (43,788k samples) are reported:
- Loss: 0.171377
- WER: 9.69
## Model Details
### Model Description
- **Developed by:** [jensenlwt](https://huggingface.co/jensenlwt)
- **Model type:** automatic-speech-recognition
- **License:** MIT
- **Finetuned from model:** [openai/whisper-small](https://huggingface.co/openai/whisper-small)
## Uses
The model is intended as exploration exercise to develop better ASR model for Singapore English (singlish).
The recommended audio usage for testing should be:
1. Involves local Singapore slang, dialect, names, and terms etc.
2. Involves Singaporean accent.
### Direct Use
To use the model in an application, you can make use of `transformers`:
```python
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="jensenlwt/whisper-small-singlish-122k")
```
### Out-of-Scope Use
- Long form audio
- Broken Singlish (typically from older generation)
- Poor quality audio (audio samples are recorded in a controlled environment)
- Conversation (as the model is not trained on conversation)
## Training Details
### Training Data
We made use of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus) for training.
In specific, we made use of **Part 2** – which is a series of audio samples of prompted read speech recordings that involves local named entities, slang, and dialect.
To train, I make used of the first 300 transcripts in the corpus, which is around 122k samples from ~161 speakers.
### Training Procedure
The model is fine-tuned with occasional interruptions to adjust batch size to maximise GPU utilisation.
In addition, I also end training early if eval_loss does not decrease in two evaluation steps as per previous training experience.
#### Training Hyperparameters
The following hyperparameters are used:
- **batch_size**: 128
- **gradient_accumulation_steps**: 1
- **learning_rate**: 1e-5
- **warmup_steps**: 500
- **max_steps**: 5000
- **fp16**: true
- **eval_batch_size**: 32
- **eval_step**: 500
- **max_grad_norm**: 1.0
- **generation_max_length**: 225
#### Training Results
| Steps | Epoch | Train Loss | Eval Loss | WER |
|:-----:|:--------:|:----------:|:---------:|:------------------:|
| 500 | 0.654450 | 0.7418 | 0.3889 | 17.968250 |
| 1000 | 1.308901 | 0.2831 | 0.2519 | 11.880948 |
| 1500 | 1.963351 | 0.1960 | 0.2038 | 9.948440 |
| 2000 | 2.617801 | 0.1236 | 0.1872 | 9.420248 |
| 2500 | 3.272251 | 0.0970 | 0.1791 | 8.539280 |
| 3000 | 3.926702 | 0.0728 | 0.1714 | 8.207827 |
| 3500 | 4.581152 | 0.0484 | 0.1741 | 8.145801 |
| 4000 | 5.235602 | 0.0401 | 0.1773 | 8.138047 |
The model with the lowest evaluation loss is used as the final checkpoint.
### Testing Data, Factors & Metrics
#### Testing Data
To test the model, I made use of the last 100 transcripts (held-out test set) in the corpus, which is around 43k samples.
### Results
| Model | WER |
|:----------------------------:|:-----:|
| fine-tuned-122k-whisper-small| 9.69% |
#### Summary
The overall model is not perfect, but if audio is spoken clearly, the model is able to transcribe Singaporean terms and slang accurately.
### Compute Infrastructure
Trained on VM instance provisioned on [jarvislabs.ai](https://jarvislabs.ai/).
#### Hardware
- Single A6000 GPU
## Model Card Authors [optional]
[More Information Needed]
## Model Card Contact
[Low Wei Teck](mailto:[email protected]) |