language:
- en
license: mit
base_model: openai/whisper-small
tags:
- generated_from_trainer
metrics:
- wer
model-index:
- name: whisper-small-singlish-122k
results:
- task:
type: automatic-speech-recognition
dataset:
name: NSC
type: NSC
metrics:
- name: WER
type: WER
value: 9.69
Whisper-small-singlish-122k.
This model is a openai/whisper-small, fine-tuned on a subset (122k samples) of the National Speech Corpus.
The following results on the evaluation set (43,788k samples) are reported:
- Loss: 0.171377
- WER: 9.69
Model Details
Model Description
- Developed by: jensenlwt
- Model type: automatic-speech-recognition
- License: MIT
- Finetuned from model: openai/whisper-small
Uses
The model is intended as exploration exercise to develop better ASR model for Singapore English (singlish).
The recommended audio usage for testing should be:
- Involves local Singapore slang, dialect, names, and terms etc.
- Involves Singaporean accent.
Direct Use
To use the model in an application, you can make use of transformers
:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="jensenlwt/whisper-small-singlish-122k")
Out-of-Scope Use
- Long form audio
- Broken Singlish (typically from older generation)
- Poor quality audio (audio samples are recorded in a controlled environment)
- Conversation (as the model is not trained on conversation)
Training Details
Training Data
We made use of the National Speech Corpus for training. In specific, we made use of Part 2 – which is a series of audio samples of prompted read speech recordings that involves local named entities, slang, and dialect.
To train, I make used of the first 300 transcripts in the corpus, which is around 122k samples from ~161 speakers.
Training Procedure
The model is fine-tuned with occasional interruptions to adjust batch size to maximise GPU utilisation. In addition, I also end training early if eval_loss does not decrease in two evaluation steps as per previous training experience.
Training Hyperparameters
The following hyperparameters are used:
- batch_size: 128
- gradient_accumulation_steps: 1
- learning_rate: 1e-5
- warmup_steps: 500
- max_steps: 5000
- fp16: true
- eval_batch_size: 32
- eval_step: 500
- max_grad_norm: 1.0
- generation_max_length: 225
Training Results
Steps | Epoch | Train Loss | Eval Loss | WER |
---|---|---|---|---|
500 | 0.654450 | 0.7418 | 0.3889 | 17.968250 |
1000 | 1.308901 | 0.2831 | 0.2519 | 11.880948 |
1500 | 1.963351 | 0.1960 | 0.2038 | 9.948440 |
2000 | 2.617801 | 0.1236 | 0.1872 | 9.420248 |
2500 | 3.272251 | 0.0970 | 0.1791 | 8.539280 |
3000 | 3.926702 | 0.0728 | 0.1714 | 8.207827 |
3500 | 4.581152 | 0.0484 | 0.1741 | 8.145801 |
4000 | 5.235602 | 0.0401 | 0.1773 | 8.138047 |
The model with the lowest evaluation loss is used as the final checkpoint.
Testing Data, Factors & Metrics
Testing Data
To test the model, I made use of the last 100 transcripts (held-out test set) in the corpus, which is around 43k samples.
Results
Model | WER |
---|---|
fine-tuned-122k-whisper-small | 9.69% |
Summary
The overall model is not perfect, but if audio is spoken clearly, the model is able to transcribe Singaporean terms and slang accurately.
Compute Infrastructure
Trained on VM instance provisioned on jarvislabs.ai.
Hardware
- Single A6000 GPU
Model Card Authors [optional]
[More Information Needed]