--- language: - en license: mit base_model: openai/whisper-small tags: - generated_from_trainer metrics: - wer model-index: - name: whisper-small-singlish-122k results: - task: type: automatic-speech-recognition dataset: name: NSC type: NSC metrics: - name: WER type: WER value: 9.69 --- # Whisper-small-singlish-122k. This model is a [openai/whisper-small](https://huggingface.co/openai/whisper-small), fine-tuned on a subset (122k samples) of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus). The following results on the evaluation set (43,788k samples) are reported: - Loss: 0.171377 - WER: 9.69 ## Model Details ### Model Description - **Developed by:** [jensenlwt](https://huggingface.co/jensenlwt) - **Model type:** automatic-speech-recognition - **License:** MIT - **Finetuned from model:** [openai/whisper-small](https://huggingface.co/openai/whisper-small) ## Uses The model is intended as exploration exercise to develop better ASR model for Singapore English (singlish). The recommended audio usage for testing should be: 1. Involves local Singapore slang, dialect, names, and terms etc. 2. Involves Singaporean accent. ### Direct Use To use the model in an application, you can make use of `transformers`: ```python # Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="jensenlwt/whisper-small-singlish-122k") ``` ### Out-of-Scope Use - Long form audio - Broken Singlish (typically from older generation) - Poor quality audio (audio samples are recorded in a controlled environment) - Conversation (as the model is not trained on conversation) ## Training Details ### Training Data We made use of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus) for training. In specific, we made use of **Part 2** – which is a series of audio samples of prompted read speech recordings that involves local named entities, slang, and dialect. To train, I make used of the first 300 transcripts in the corpus, which is around 122k samples from ~161 speakers. ### Training Procedure The model is fine-tuned with occasional interruptions to adjust batch size to maximise GPU utilisation. In addition, I also end training early if eval_loss does not decrease in two evaluation steps as per previous training experience. #### Training Hyperparameters The following hyperparameters are used: - **batch_size**: 128 - **gradient_accumulation_steps**: 1 - **learning_rate**: 1e-5 - **warmup_steps**: 500 - **max_steps**: 5000 - **fp16**: true - **eval_batch_size**: 32 - **eval_step**: 500 - **max_grad_norm**: 1.0 - **generation_max_length**: 225 #### Training Results | Steps | Epoch | Train Loss | Eval Loss | WER | |:-----:|:--------:|:----------:|:---------:|:------------------:| | 500 | 0.654450 | 0.7418 | 0.3889 | 17.968250 | | 1000 | 1.308901 | 0.2831 | 0.2519 | 11.880948 | | 1500 | 1.963351 | 0.1960 | 0.2038 | 9.948440 | | 2000 | 2.617801 | 0.1236 | 0.1872 | 9.420248 | | 2500 | 3.272251 | 0.0970 | 0.1791 | 8.539280 | | 3000 | 3.926702 | 0.0728 | 0.1714 | 8.207827 | | 3500 | 4.581152 | 0.0484 | 0.1741 | 8.145801 | | 4000 | 5.235602 | 0.0401 | 0.1773 | 8.138047 | The model with the lowest evaluation loss is used as the final checkpoint. ### Testing Data, Factors & Metrics #### Testing Data To test the model, I made use of the last 100 transcripts (held-out test set) in the corpus, which is around 43k samples. ### Results | Model | WER | |:----------------------------:|:-----:| | fine-tuned-122k-whisper-small| 9.69% | #### Summary The overall model is not perfect, but if audio is spoken clearly, the model is able to transcribe Singaporean terms and slang accurately. ### Compute Infrastructure Trained on VM instance provisioned on [jarvislabs.ai](https://jarvislabs.ai/). #### Hardware - Single A6000 GPU ## Model Card Authors [optional] [More Information Needed] ## Model Card Contact [Low Wei Teck](mailto:jensenlwt@gmail.com)