---
language:
- en
license: mit
base_model: openai/whisper-small
tags:
- generated_from_trainer
metrics:
- wer
model-index:
  - name: whisper-small-singlish-122k
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: NSC
          type: NSC
        metrics:
          - name: WER
            type: WER
            value: 9.69
---

# Whisper-small-singlish-122k.

This model is a [openai/whisper-small](https://huggingface.co/openai/whisper-small), fine-tuned on a subset (122k samples) of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus).

The following results on the evaluation set (43,788k samples) are reported:

- Loss: 0.171377
- WER: 9.69

## Model Details

### Model Description

- **Developed by:** [jensenlwt](https://huggingface.co/jensenlwt)
- **Model type:** automatic-speech-recognition
- **License:** MIT
- **Finetuned from model:** [openai/whisper-small](https://huggingface.co/openai/whisper-small)

## Uses

The model is intended as exploration exercise to develop better ASR model for Singapore English (singlish). 

The recommended audio usage for testing should be:

1. Involves local Singapore slang, dialect, names, and terms etc.
2. Involves Singaporean accent.

### Direct Use

To use the model in an application, you can make use of `transformers`:

```python
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="jensenlwt/whisper-small-singlish-122k")
```

### Out-of-Scope Use

- Long form audio
- Broken Singlish (typically from older generation)
- Poor quality audio (audio samples are recorded in a controlled environment)
- Conversation (as the model is not trained on conversation)

## Training Details

### Training Data

We made use of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus) for training.
In specific, we made use of **Part 2** – which is a series of audio samples of prompted read speech recordings that involves local named entities, slang, and dialect.

To train, I make used of the first 300 transcripts in the corpus, which is around 122k samples from ~161 speakers.

### Training Procedure

The model is fine-tuned with occasional interruptions to adjust batch size to maximise GPU utilisation.
In addition, I also end training early if eval_loss does not decrease in two evaluation steps as per previous training experience.

#### Training Hyperparameters

The following hyperparameters are used:

- **batch_size**: 128
- **gradient_accumulation_steps**: 1
- **learning_rate**: 1e-5
- **warmup_steps**: 500
- **max_steps**: 5000
- **fp16**: true
- **eval_batch_size**: 32
- **eval_step**: 500
- **max_grad_norm**: 1.0
- **generation_max_length**: 225

#### Training Results

| Steps | Epoch    | Train Loss | Eval Loss | WER                |
|:-----:|:--------:|:----------:|:---------:|:------------------:|
| 500   | 0.654450 | 0.7418     | 0.3889    | 17.968250          |
| 1000  | 1.308901 | 0.2831     | 0.2519    | 11.880948          |
| 1500  | 1.963351 | 0.1960     | 0.2038    | 9.948440           |
| 2000  | 2.617801 | 0.1236     | 0.1872    | 9.420248           |
| 2500  | 3.272251 | 0.0970     | 0.1791    | 8.539280           |
| 3000  | 3.926702 | 0.0728     | 0.1714    | 8.207827           |
| 3500  | 4.581152 | 0.0484     | 0.1741    | 8.145801           |
| 4000  | 5.235602 | 0.0401     | 0.1773    | 8.138047           |

The model with the lowest evaluation loss is used as the final checkpoint.

### Testing Data, Factors & Metrics

#### Testing Data

To test the model, I made use of the last 100 transcripts (held-out test set) in the corpus, which is around 43k samples.

### Results

| Model                        | WER   |
|:----------------------------:|:-----:|
| fine-tuned-122k-whisper-small| 9.69% |

#### Summary

The overall model is not perfect, but if audio is spoken clearly, the model is able to transcribe Singaporean terms and slang accurately.

### Compute Infrastructure

Trained on VM instance provisioned on [jarvislabs.ai](https://jarvislabs.ai/).

#### Hardware

- Single A6000 GPU

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[Low Wei Teck](mailto:jensenlwt@gmail.com)