jensenlwt
/

whisper-small-singlish-122k

Automatic Speech Recognition

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

wtlow003 commited on Jul 24

Commit

eb7449f

•

1 Parent(s): ce81198

fix: README.md

Files changed (1) hide show

README.md +19 -5

README.md CHANGED Viewed

@@ -40,6 +40,13 @@ The recommended audio usage for testing should be:
 To use the model in an application, you can make use of `transformers`:
 ### Out-of-Scope Use
 - Long form audio
@@ -47,16 +54,20 @@ To use the model in an application, you can make use of `transformers`:
 - Poor quality audio (audio samples are recorded in a controlled environment)
 - Conversation (as the model is not trained on conversation)
-## How to Get Started with the Model
 ## Training Details
 ### Training Data
 ### Training Procedure
 #### Training Hyperparameters
 The following hyperparameters are used:
@@ -85,10 +96,14 @@ The following hyperparameters are used:
 | 3500  | 4.581152 | 0.0484     | 0.1741    | 8.145801           |
 | 4000  | 5.235602 | 0.0401     | 0.1773    | 8.138047           |
 ### Testing Data, Factors & Metrics
 #### Testing Data
 ### Results
 | Model                        | WER   |
@@ -102,7 +117,6 @@ The following hyperparameters are used:
 ### Model Architecture and Objective
 ### Compute Infrastructure
 [More Information Needed]

 To use the model in an application, you can make use of `transformers`:
+```python
+# Use a pipeline as a high-level helper
+from transformers import pipeline
+pipe = pipeline("automatic-speech-recognition", model="jensenlwt/whisper-small-singlish-122k")
+```
 ### Out-of-Scope Use
 - Long form audio
 - Poor quality audio (audio samples are recorded in a controlled environment)
 - Conversation (as the model is not trained on conversation)
 ## Training Details
 ### Training Data
+We made use of the [National Speech Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus) for training.
+In specific, we made use of **Part 2** – which is a series of audio samples of prompted read speech recordings that involves local named entities, slang, and dialect.
+To train, I make used of the first 300 transcripts in the corpus, which is around 122k samples from ~161 speakers.
 ### Training Procedure
+The model is fine-tuned with occasional interruptions to adjust batch size to maximise GPU utilisation.
+In addition, I also end training early if eval_loss does not decrease in two evaluation steps as per previous training experience.
 #### Training Hyperparameters
 The following hyperparameters are used:
 | 3500  | 4.581152 | 0.0484     | 0.1741    | 8.145801           |
 | 4000  | 5.235602 | 0.0401     | 0.1773    | 8.138047           |
+The model with the lowest evaluation loss is used as the final checkpoint.
 ### Testing Data, Factors & Metrics
 #### Testing Data
+To test the model, I made use of the last 100 transcripts (held-out test set) in the corpus, which is around 43k samples.
 ### Results
 | Model                        | WER   |
 ### Model Architecture and Objective
 ### Compute Infrastructure
 [More Information Needed]