File size: 818 Bytes
837051f
 
 
 
 
 
 
 
 
 
 
 
b19e2d3
837051f
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
---
language:
- ms
- en
---

# Malaysian Finetune Whisper Base

Finetune Whisper Base on Malaysian dataset,
1. IMDA STT, https://huggingface.co/datasets/mesolitica/IMDA-STT
2. Pseudolabel Malaysian youtube videos, https://huggingface.co/datasets/mesolitica/pseudolabel-malaysian-youtube-whisper-large-v3
3. Malay Conversational Speech Corpus, https://huggingface.co/datasets/malaysia-ai/malay-conversational-speech-corpus
4. Haqkiem TTS Dataset, this is private, but you can request access from https://www.linkedin.com/in/haqkiem-daim/
5. Pseudolabel Nusantara audiobooks, https://huggingface.co/datasets/mesolitica/nusantara-audiobook

Script at https://github.com/mesolitica/malaya-speech/tree/malaysian-speech/session/whisper

Wandb at https://wandb.ai/huseinzol05/malaysian-whisper-base?workspace=user-huseinzol05