iqbalc
/

stt_de_conformer_transducer_large

Automatic Speech Recognition

Model card Files Files and versions Community

iqbalc commited on Sep 12, 2022

Commit

380d2be

•

1 Parent(s): 6d6269c

updated

Files changed (1) hide show

README.md +16 -19

README.md CHANGED Viewed

@@ -58,13 +58,10 @@ asr_model = nemo_asr.models.ASRModel.from_pretrained("iqbalc/stt_de_conformer_tr
 ```
 ### Transcribing using Python
-First, let's get a sample
 ```
-wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
 ```
-Then simply do:
-```
-asr_model.transcribe(['2086-149220-0033.wav'])
 ```
 ### Transcribing many audio files
@@ -83,32 +80,32 @@ This model provides transcribed speech as a string for a given audio sample.
 ## Model Architecture
-<ADD SOME INFORMATION ABOUT THE ARCHITECTURE>
 ## Training
-<ADD INFORMATION ABOUT HOW THE MODEL WAS TRAINED - HOW MANY EPOCHS, AMOUNT OF COMPUTE ETC>
 ### Datasets
-<LIST THE NAME AND SPLITS OF DATASETS USED TO TRAIN THIS MODEL (ALONG WITH LANGUAGE AND ANY ADDITIONAL INFORMATION)>
 ## Performance
-<LIST THE SCORES OF THE MODEL -
-      OR
-USE THE Hugging Face Evaluate LiBRARY TO UPLOAD METRICS>
-## Limitations
-<DECLARE ANY POTENTIAL LIMITATIONS OF THE MODEL>
-Eg:
-Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
 ## References
-<ADD ANY REFERENCES HERE AS NEEDED>
-[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

 ```
 ### Transcribing using Python
 ```
+simply do:
 ```
+asr_model.transcribe(['filename.wav'])
 ```
 ### Transcribing many audio files
 ## Model Architecture
+Conformer-Transducer model is an autoregressive variant of Conformer model for Automatic Speech Recognition which uses Transducer loss/decoding
 ## Training
+The NeMo toolkit was used for training the models. These models are fine-tuned with this example script and this base config.
+The tokenizers for these models were built using the text transcripts of the train set with this script.
 ### Datasets
+All the models in this collection are trained on a composite dataset comprising of over two thousand hours of cleaned German speech:
+1. MCV7.0 567 hours
+2. MLS 1524 hours
+3. VoxPopuli 214 hours
 ## Performance
+Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
+MCV7.0 test	= 4.93
+## Limitations
+The model might perform worse for accented speech
 ## References
+[NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)