Updates README errors
Browse files
README.md
CHANGED
@@ -231,16 +231,16 @@ The following hyperparameters were used during training:
|
|
231 |
|
232 |
While debugging other training sessions where more data from the Esperanto Common Voice dataset was used -- some loss calculations were returning either `inf` or `nan` -- I found that some of the training set trained with this model had surprisingly high CER. Some examples:
|
233 |
|
234 |
-
| file | Actual<br>Predicted | CER | Comment |
|
235 |
|:-----|:--------------------|:----|:--------|
|
236 |
-
|common_voice_eo_25365027.mp3 | en la hansaj agentejoj komercistoj el la regiono renkontis kolegojn el aliaj regionoj<br>a taaj keo eoj eejn kigos eegoj eioeegiooj| 0.61 | No audio |
|
237 |
-
|common_voice_eo_25365472.mp3 | ili vendas armilojn kaj teknologiojn al la fanatikuloj por gajni monon monon monon<br>ila mamato aiil ajn kno ion a a aotigojn pu aiooo aj knon | 0.55 | Barely any audio, distorted |
|
238 |
-
|common_voice_eo_25365836.mp3 | industria apliko estas la kreado de modifitaj bakterioj kiuj produktas deziratan kemian substancon<br>iiti sieetas la eeadooddddooiooaotooeioj aiicenon | 0.67 | Barely any audio, distorted |
|
239 |
-
|2600 | ili akiras plenkreskan plumaron nur en la kvina jaro<br>ili aaros peetaj patato a a sia ro | 0.52 | It's literally someone saying 'injabum'. Thanks, troll. |
|
240 |
-
|7333 | poste sekvas difinoj de la termino<br>po | 0.94 | No audio |
|
241 |
-
|7334 | li gvidis multajn kursojn laŭ la csehmetodo<br>po | 0.98 | No audio |
|
242 |
-
|7429 | tamen pro la rekonstruo de kluzoj ne eblas trapasi komplete<br>po | 0.97 | No audio |
|
243 |
-
|11662 | lingvotesto estas postulata ekzemple por akceptiĝo en anglalingvaj altlernejoj<br>linkonteto estastitot etateerteito en pootaeaje lgijoj | 0.58 | No audio |
|
244 |
|
245 |
Some examples have no audio. All of these files in the dataset are completely useless, and should be removed from the training set.
|
246 |
|
@@ -270,7 +270,7 @@ By running `run_speech_recognition_ctc` with `do_train=false`, setting `model_na
|
|
270 |
metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
|
271 |
```
|
272 |
|
273 |
-
Doing this shows that of the 14913 examples in `test`, the following
|
274 |
|
275 |
`common_voice_eo_25167318.mp3`
|
276 |
|
@@ -278,7 +278,7 @@ The audio on this is severly garbled. This should absolutely be filtered out of
|
|
278 |
|
279 |
No `validation` samples result in `inf` or `nan`.
|
280 |
|
281 |
-
The following
|
282 |
|
283 |
```txt
|
284 |
common_voice_eo_25467641.mp3
|
@@ -313,6 +313,8 @@ Since this model seems to work well enough, I could run inference on all samples
|
|
313 |
|
314 |
#### Test set
|
315 |
|
|
|
|
|
316 |
```txt
|
317 |
common_voice_eo_25214319.mp3
|
318 |
common_voice_eo_25006596.mp3
|
@@ -387,11 +389,12 @@ common_voice_eo_25252698.mp3
|
|
387 |
common_voice_eo_25518636.mp3
|
388 |
```
|
389 |
|
390 |
-
Note on
|
391 |
|
392 |
#### Validation set
|
393 |
|
394 |
-
|
|
|
395 |
```txt
|
396 |
common_voice_eo_25392669.mp3
|
397 |
common_voice_eo_25392674.mp3
|
@@ -414,7 +417,6 @@ common_voice_eo_27380623.mp3
|
|
414 |
|
415 |
I didn't include some which had high CER because of hallucinations during a one-word recording with lots of silence before and after. The recording itself is fine on these.
|
416 |
|
417 |
-
|
418 |
#### Training set
|
419 |
|
420 |
135 of 143984 examples yielded high CER. I removed some from this list that had high CER but sounded fine.
|
|
|
231 |
|
232 |
While debugging other training sessions where more data from the Esperanto Common Voice dataset was used -- some loss calculations were returning either `inf` or `nan` -- I found that some of the training set trained with this model had surprisingly high CER. Some examples:
|
233 |
|
234 |
+
| file | Actual<br>---<br>Predicted | CER | Comment |
|
235 |
|:-----|:--------------------|:----|:--------|
|
236 |
+
|common_voice_eo_25365027.mp3 | en la hansaj agentejoj komercistoj el la regiono renkontis kolegojn el aliaj regionoj<br>---<br>a taaj keo eoj eejn kigos eegoj eioeegiooj| 0.61 | No audio |
|
237 |
+
|common_voice_eo_25365472.mp3 | ili vendas armilojn kaj teknologiojn al la fanatikuloj por gajni monon monon monon<br>---<br>ila mamato aiil ajn kno ion a a aotigojn pu aiooo aj knon | 0.55 | Barely any audio, distorted |
|
238 |
+
|common_voice_eo_25365836.mp3 | industria apliko estas la kreado de modifitaj bakterioj kiuj produktas deziratan kemian substancon<br>---<br>iiti sieetas la eeadooddddooiooaotooeioj aiicenon | 0.67 | Barely any audio, distorted |
|
239 |
+
|2600 | ili akiras plenkreskan plumaron nur en la kvina jaro<br>---<br>ili aaros peetaj patato a a sia ro | 0.52 | It's literally someone saying 'injabum'. Thanks, troll. |
|
240 |
+
|7333 | poste sekvas difinoj de la termino<br>---<br>po | 0.94 | No audio |
|
241 |
+
|7334 | li gvidis multajn kursojn laŭ la csehmetodo<br>---<br>po | 0.98 | No audio |
|
242 |
+
|7429 | tamen pro la rekonstruo de kluzoj ne eblas trapasi komplete<br>---<br>po | 0.97 | No audio |
|
243 |
+
|11662 | lingvotesto estas postulata ekzemple por akceptiĝo en anglalingvaj altlernejoj<br>---<br>linkonteto estastitot etateerteito en pootaeaje lgijoj | 0.58 | No audio |
|
244 |
|
245 |
Some examples have no audio. All of these files in the dataset are completely useless, and should be removed from the training set.
|
246 |
|
|
|
270 |
metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
|
271 |
```
|
272 |
|
273 |
+
Doing this shows that of the 14913 examples in `test`, the following example results in `inf` loss:
|
274 |
|
275 |
`common_voice_eo_25167318.mp3`
|
276 |
|
|
|
278 |
|
279 |
No `validation` samples result in `inf` or `nan`.
|
280 |
|
281 |
+
The following 18 out of 143984 examples in `train` result in `inf` loss:
|
282 |
|
283 |
```txt
|
284 |
common_voice_eo_25467641.mp3
|
|
|
313 |
|
314 |
#### Test set
|
315 |
|
316 |
+
71 of 14913 examples in the test set show high CER.
|
317 |
+
|
318 |
```txt
|
319 |
common_voice_eo_25214319.mp3
|
320 |
common_voice_eo_25006596.mp3
|
|
|
389 |
common_voice_eo_25518636.mp3
|
390 |
```
|
391 |
|
392 |
+
Note on two of the examples: We know that _saluton kiel vi fartas_ ("Hello, how are you") and _atendu momenton_ ("Wait a moment") is a good start in learning Esperanto, but if that's not the text to record, you're not really helping.
|
393 |
|
394 |
#### Validation set
|
395 |
|
396 |
+
17 of 14909 examples in the test set show high CER.
|
397 |
+
|
398 |
```txt
|
399 |
common_voice_eo_25392669.mp3
|
400 |
common_voice_eo_25392674.mp3
|
|
|
417 |
|
418 |
I didn't include some which had high CER because of hallucinations during a one-word recording with lots of silence before and after. The recording itself is fine on these.
|
419 |
|
|
|
420 |
#### Training set
|
421 |
|
422 |
135 of 143984 examples yielded high CER. I removed some from this list that had high CER but sounded fine.
|