metadata

language:
  - en
license: mit
base_model: openai/whisper-small
tags:
  - generated_from_trainer
metrics:
  - wer
model-index:
  - name: whisper-small-singlish-122k
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: NSC
          type: NSC
        metrics:
          - name: WER
            type: WER
            value: 9.69

Whisper-small-singlish-122k.

This model is a openai/whisper-small, fine-tuned on a subset (122k samples) of the National Speech Corpus.

The following results on the evaluation set (43,788k samples) are reported:

Loss: 0.171377
WER: 9.69

Model Details

Model Description

Developed by: jensenlwt
Model type: automatic-speech-recognition
License: MIT
Finetuned from model: openai/whisper-small

Uses

The model is intended as exploration exercise to develop better ASR model for Singapore English (singlish).

The recommended audio usage for testing should be:

Involves local Singapore slang, dialect, names, and terms etc.
Involves Singaporean accent.

Direct Use

To use the model in an application, you can make use of transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="jensenlwt/whisper-small-singlish-122k")

Out-of-Scope Use

Long form audio
Broken Singlish (typically from older generation)
Poor quality audio (audio samples are recorded in a controlled environment)
Conversation (as the model is not trained on conversation)

Training Details

Training Data

We made use of the National Speech Corpus for training. In specific, we made use of Part 2 – which is a series of audio samples of prompted read speech recordings that involves local named entities, slang, and dialect.

To train, I make used of the first 300 transcripts in the corpus, which is around 122k samples from ~161 speakers.

Training Procedure

The model is fine-tuned with occasional interruptions to adjust batch size to maximise GPU utilisation. In addition, I also end training early if eval_loss does not decrease in two evaluation steps as per previous training experience.

Training Hyperparameters

The following hyperparameters are used:

batch_size: 128
gradient_accumulation_steps: 1
learning_rate: 1e-5
warmup_steps: 500
max_steps: 5000
fp16: true
eval_batch_size: 32
eval_step: 500
max_grad_norm: 1.0
generation_max_length: 225

Training Results

Steps	Epoch	Train Loss	Eval Loss	WER
500	0.654450	0.7418	0.3889	17.968250
1000	1.308901	0.2831	0.2519	11.880948
1500	1.963351	0.1960	0.2038	9.948440
2000	2.617801	0.1236	0.1872	9.420248
2500	3.272251	0.0970	0.1791	8.539280
3000	3.926702	0.0728	0.1714	8.207827
3500	4.581152	0.0484	0.1741	8.145801
4000	5.235602	0.0401	0.1773	8.138047

The model with the lowest evaluation loss is used as the final checkpoint.

Testing Data, Factors & Metrics

Testing Data

To test the model, I made use of the last 100 transcripts (held-out test set) in the corpus, which is around 43k samples.

Results

Model	WER
fine-tuned-122k-whisper-small	9.69%

Summary

The overall model is not perfect, but if audio is spoken clearly, the model is able to transcribe Singaporean terms and slang accurately.

Compute Infrastructure

Trained on VM instance provisioned on jarvislabs.ai.

Hardware

Single A6000 GPU

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

Low Wei Teck