updated
Browse files
README.md
CHANGED
@@ -58,13 +58,10 @@ asr_model = nemo_asr.models.ASRModel.from_pretrained("iqbalc/stt_de_conformer_tr
|
|
58 |
```
|
59 |
|
60 |
### Transcribing using Python
|
61 |
-
First, let's get a sample
|
62 |
```
|
63 |
-
|
64 |
```
|
65 |
-
|
66 |
-
```
|
67 |
-
asr_model.transcribe(['2086-149220-0033.wav'])
|
68 |
```
|
69 |
|
70 |
### Transcribing many audio files
|
@@ -83,32 +80,32 @@ This model provides transcribed speech as a string for a given audio sample.
|
|
83 |
|
84 |
## Model Architecture
|
85 |
|
86 |
-
|
87 |
|
88 |
## Training
|
89 |
|
90 |
-
|
|
|
|
|
91 |
|
92 |
### Datasets
|
93 |
|
94 |
-
|
|
|
|
|
|
|
|
|
95 |
|
96 |
## Performance
|
97 |
|
98 |
-
|
99 |
-
OR
|
100 |
-
USE THE Hugging Face Evaluate LiBRARY TO UPLOAD METRICS>
|
101 |
|
102 |
-
|
103 |
|
104 |
-
|
105 |
|
106 |
-
|
107 |
-
Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
|
108 |
|
109 |
|
110 |
## References
|
111 |
-
|
112 |
-
<ADD ANY REFERENCES HERE AS NEEDED>
|
113 |
-
|
114 |
-
[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
|
|
58 |
```
|
59 |
|
60 |
### Transcribing using Python
|
|
|
61 |
```
|
62 |
+
simply do:
|
63 |
```
|
64 |
+
asr_model.transcribe(['filename.wav'])
|
|
|
|
|
65 |
```
|
66 |
|
67 |
### Transcribing many audio files
|
|
|
80 |
|
81 |
## Model Architecture
|
82 |
|
83 |
+
Conformer-Transducer model is an autoregressive variant of Conformer model for Automatic Speech Recognition which uses Transducer loss/decoding
|
84 |
|
85 |
## Training
|
86 |
|
87 |
+
The NeMo toolkit was used for training the models. These models are fine-tuned with this example script and this base config.
|
88 |
+
|
89 |
+
The tokenizers for these models were built using the text transcripts of the train set with this script.
|
90 |
|
91 |
### Datasets
|
92 |
|
93 |
+
All the models in this collection are trained on a composite dataset comprising of over two thousand hours of cleaned German speech:
|
94 |
+
|
95 |
+
1. MCV7.0 567 hours
|
96 |
+
2. MLS 1524 hours
|
97 |
+
3. VoxPopuli 214 hours
|
98 |
|
99 |
## Performance
|
100 |
|
101 |
+
Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
|
|
|
|
|
102 |
|
103 |
+
MCV7.0 test = 4.93
|
104 |
|
105 |
+
## Limitations
|
106 |
|
107 |
+
The model might perform worse for accented speech
|
|
|
108 |
|
109 |
|
110 |
## References
|
111 |
+
[NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
|
|
|
|
|