|
--- |
|
language: |
|
- en |
|
license: mit |
|
base_model: openai/whisper-small |
|
tags: |
|
- generated_from_trainer |
|
metrics: |
|
- wer |
|
model-index: |
|
- name: whisper-small-singlish-122k |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: NSC |
|
type: NSC |
|
metrics: |
|
- name: WER |
|
type: WER |
|
value: 9.69 |
|
--- |
|
|
|
# Whisper-small-singlish-122k. |
|
|
|
This model is a [openai/whisper-small](https://huggingface.co/openai/whisper-small), fine-tuned on a subset (122k samples) of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus). |
|
|
|
The following results on the evaluation set (43,788k samples) are reported: |
|
|
|
- Loss: 0.171377 |
|
- WER: 9.69 |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** [jensenlwt](https://huggingface.co/jensenlwt) |
|
- **Model type:** automatic-speech-recognition |
|
- **License:** MIT |
|
- **Finetuned from model:** [openai/whisper-small](https://huggingface.co/openai/whisper-small) |
|
|
|
## Uses |
|
|
|
The model is intended as exploration exercise to develop better ASR model for Singapore English (singlish). |
|
|
|
The recommended audio usage for testing should be: |
|
|
|
1. Involves local Singapore slang, dialect, names, and terms etc. |
|
2. Involves Singaporean accent. |
|
|
|
### Direct Use |
|
|
|
To use the model in an application, you can make use of `transformers`: |
|
|
|
```python |
|
# Use a pipeline as a high-level helper |
|
from transformers import pipeline |
|
|
|
pipe = pipeline("automatic-speech-recognition", model="jensenlwt/whisper-small-singlish-122k") |
|
``` |
|
|
|
### Out-of-Scope Use |
|
|
|
- Long form audio |
|
- Broken Singlish (typically from older generation) |
|
- Poor quality audio (audio samples are recorded in a controlled environment) |
|
- Conversation (as the model is not trained on conversation) |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
We made use of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus) for training. |
|
In specific, we made use of **Part 2** – which is a series of audio samples of prompted read speech recordings that involves local named entities, slang, and dialect. |
|
|
|
To train, I make used of the first 300 transcripts in the corpus, which is around 122k samples from ~161 speakers. |
|
|
|
### Training Procedure |
|
|
|
The model is fine-tuned with occasional interruptions to adjust batch size to maximise GPU utilisation. |
|
In addition, I also end training early if eval_loss does not decrease in two evaluation steps as per previous training experience. |
|
|
|
#### Training Hyperparameters |
|
|
|
The following hyperparameters are used: |
|
|
|
- **batch_size**: 128 |
|
- **gradient_accumulation_steps**: 1 |
|
- **learning_rate**: 1e-5 |
|
- **warmup_steps**: 500 |
|
- **max_steps**: 5000 |
|
- **fp16**: true |
|
- **eval_batch_size**: 32 |
|
- **eval_step**: 500 |
|
- **max_grad_norm**: 1.0 |
|
- **generation_max_length**: 225 |
|
|
|
#### Training Results |
|
|
|
| Steps | Epoch | Train Loss | Eval Loss | WER | |
|
|:-----:|:--------:|:----------:|:---------:|:------------------:| |
|
| 500 | 0.654450 | 0.7418 | 0.3889 | 17.968250 | |
|
| 1000 | 1.308901 | 0.2831 | 0.2519 | 11.880948 | |
|
| 1500 | 1.963351 | 0.1960 | 0.2038 | 9.948440 | |
|
| 2000 | 2.617801 | 0.1236 | 0.1872 | 9.420248 | |
|
| 2500 | 3.272251 | 0.0970 | 0.1791 | 8.539280 | |
|
| 3000 | 3.926702 | 0.0728 | 0.1714 | 8.207827 | |
|
| 3500 | 4.581152 | 0.0484 | 0.1741 | 8.145801 | |
|
| 4000 | 5.235602 | 0.0401 | 0.1773 | 8.138047 | |
|
|
|
The model with the lowest evaluation loss is used as the final checkpoint. |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
To test the model, I made use of the last 100 transcripts (held-out test set) in the corpus, which is around 43k samples. |
|
|
|
### Results |
|
|
|
| Model | WER | |
|
|:----------------------------:|:-----:| |
|
| fine-tuned-122k-whisper-small| 9.69% | |
|
|
|
#### Summary |
|
|
|
The overall model is not perfect, but if audio is spoken clearly, the model is able to transcribe Singaporean terms and slang accurately. |
|
|
|
### Compute Infrastructure |
|
|
|
Trained on VM instance provisioned on [jarvislabs.ai](https://jarvislabs.ai/). |
|
|
|
#### Hardware |
|
|
|
- Single A6000 GPU |
|
|
|
## Model Card Authors [optional] |
|
|
|
[More Information Needed] |
|
|
|
## Model Card Contact |
|
|
|
[Low Wei Teck](mailto:[email protected]) |