wtlow003
fix: README.md
3452924
|
raw
history blame
4.33 kB
---
language:
- en
license: mit
base_model: openai/whisper-small
tags:
- generated_from_trainer
metrics:
- wer
model-index:
- name: whisper-small-singlish-122k
results:
- task:
type: automatic-speech-recognition
dataset:
name: NSC
type: NSC
metrics:
- name: WER
type: WER
value: 9.69
---
# Whisper-small-singlish-122k.
This model is a [openai/whisper-small](https://huggingface.co/openai/whisper-small), fine-tuned on a subset (122k samples) of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus).
The following results on the evaluation set (43,788k samples) are reported:
- Loss: 0.171377
- WER: 9.69
## Model Details
### Model Description
- **Developed by:** [jensenlwt](https://huggingface.co/jensenlwt)
- **Model type:** automatic-speech-recognition
- **License:** MIT
- **Finetuned from model:** [openai/whisper-small](https://huggingface.co/openai/whisper-small)
## Uses
The model is intended as exploration exercise to develop better ASR model for Singapore English (singlish).
The recommended audio usage for testing should be:
1. Involves local Singapore slang, dialect, names, and terms etc.
2. Involves Singaporean accent.
### Direct Use
To use the model in an application, you can make use of `transformers`:
```python
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="jensenlwt/whisper-small-singlish-122k")
```
### Out-of-Scope Use
- Long form audio
- Broken Singlish (typically from older generation)
- Poor quality audio (audio samples are recorded in a controlled environment)
- Conversation (as the model is not trained on conversation)
## Training Details
### Training Data
We made use of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus) for training.
In specific, we made use of **Part 2** – which is a series of audio samples of prompted read speech recordings that involves local named entities, slang, and dialect.
To train, I make used of the first 300 transcripts in the corpus, which is around 122k samples from ~161 speakers.
### Training Procedure
The model is fine-tuned with occasional interruptions to adjust batch size to maximise GPU utilisation.
In addition, I also end training early if eval_loss does not decrease in two evaluation steps as per previous training experience.
#### Training Hyperparameters
The following hyperparameters are used:
- **batch_size**: 128
- **gradient_accumulation_steps**: 1
- **learning_rate**: 1e-5
- **warmup_steps**: 500
- **max_steps**: 5000
- **fp16**: true
- **eval_batch_size**: 32
- **eval_step**: 500
- **max_grad_norm**: 1.0
- **generation_max_length**: 225
#### Training Results
| Steps | Epoch | Train Loss | Eval Loss | WER |
|:-----:|:--------:|:----------:|:---------:|:------------------:|
| 500 | 0.654450 | 0.7418 | 0.3889 | 17.968250 |
| 1000 | 1.308901 | 0.2831 | 0.2519 | 11.880948 |
| 1500 | 1.963351 | 0.1960 | 0.2038 | 9.948440 |
| 2000 | 2.617801 | 0.1236 | 0.1872 | 9.420248 |
| 2500 | 3.272251 | 0.0970 | 0.1791 | 8.539280 |
| 3000 | 3.926702 | 0.0728 | 0.1714 | 8.207827 |
| 3500 | 4.581152 | 0.0484 | 0.1741 | 8.145801 |
| 4000 | 5.235602 | 0.0401 | 0.1773 | 8.138047 |
The model with the lowest evaluation loss is used as the final checkpoint.
### Testing Data, Factors & Metrics
#### Testing Data
To test the model, I made use of the last 100 transcripts (held-out test set) in the corpus, which is around 43k samples.
### Results
| Model | WER |
|:----------------------------:|:-----:|
| fine-tuned-122k-whisper-small| 9.69% |
#### Summary
The overall model is not perfect, but if audio is spoken clearly, the model is able to transcribe Singaporean terms and slang accurately.
### Compute Infrastructure
Trained on VM instance provisioned on [jarvislabs.ai](https://jarvislabs.ai/).
#### Hardware
- Single A6000 GPU
## Model Card Authors [optional]
[More Information Needed]
## Model Card Contact
[Low Wei Teck](mailto:[email protected])