Transcription normalization
#2
by
qanastek
- opened
Thank you very much for your contribution to the community while sharing both models and training scripts.
You have mentioned that the training dataset consists of private subset with 40K hours of English speech plus 25K hours from the following public datasets:
- Librispeech 960 hours of English speech
- Fisher Corpus
- Switchboard-1 Dataset
- WSJ-0 and WSJ-1
- National Speech Corpus (Part 1, Part 6)
- VCTK
- VoxPopuli (EN)
- Europarl-ASR (EN)
- Multilingual Librispeech (MLS EN) - 2,000 hour subset
- Mozilla Common Voice (v7.0)
- People's Speech - 12,000 hour subset
But you haven't mentioned any of the normalization steps applied to the transcriptions, while each corpus have its own annotation protocol. Do you share these pre-processing steps anywhere ? I cannot find them on the GitHub repository of NeMo.
Regards.
Some of the dataset preprocessing scripts are made available here: https://github.com/NVIDIA/NeMo/tree/main/scripts/dataset_processing
Eventually we will make all public dataset pre processing scripts available.
smajumdar94
changed discussion status to
closed