xekri commited on
Commit
5326dd8
1 Parent(s): bfa2a0a

Updates README errors

Browse files
Files changed (1) hide show
  1. README.md +16 -14
README.md CHANGED
@@ -231,16 +231,16 @@ The following hyperparameters were used during training:
231
 
232
  While debugging other training sessions where more data from the Esperanto Common Voice dataset was used -- some loss calculations were returning either `inf` or `nan` -- I found that some of the training set trained with this model had surprisingly high CER. Some examples:
233
 
234
- | file | Actual<br>Predicted | CER | Comment |
235
  |:-----|:--------------------|:----|:--------|
236
- |common_voice_eo_25365027.mp3 | en la hansaj agentejoj komercistoj el la regiono renkontis kolegojn el aliaj regionoj<br>a taaj keo eoj eejn kigos eegoj eioeegiooj| 0.61 | No audio |
237
- |common_voice_eo_25365472.mp3 | ili vendas armilojn kaj teknologiojn al la fanatikuloj por gajni monon monon monon<br>ila mamato aiil ajn kno ion a a aotigojn pu aiooo aj knon | 0.55 | Barely any audio, distorted |
238
- |common_voice_eo_25365836.mp3 | industria apliko estas la kreado de modifitaj bakterioj kiuj produktas deziratan kemian substancon<br>iiti sieetas la eeadooddddooiooaotooeioj aiicenon | 0.67 | Barely any audio, distorted |
239
- |2600 | ili akiras plenkreskan plumaron nur en la kvina jaro<br>ili aaros peetaj patato a a sia ro | 0.52 | It's literally someone saying 'injabum'. Thanks, troll. |
240
- |7333 | poste sekvas difinoj de la termino<br>po | 0.94 | No audio |
241
- |7334 | li gvidis multajn kursojn laŭ la csehmetodo<br>po | 0.98 | No audio |
242
- |7429 | tamen pro la rekonstruo de kluzoj ne eblas trapasi komplete<br>po | 0.97 | No audio |
243
- |11662 | lingvotesto estas postulata ekzemple por akceptiĝo en anglalingvaj altlernejoj<br>linkonteto estastitot etateerteito en pootaeaje lgijoj | 0.58 | No audio |
244
 
245
  Some examples have no audio. All of these files in the dataset are completely useless, and should be removed from the training set.
246
 
@@ -270,7 +270,7 @@ By running `run_speech_recognition_ctc` with `do_train=false`, setting `model_na
270
  metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
271
  ```
272
 
273
- Doing this shows that of the 14913 examples in `test`, the following file results in `inf` loss:
274
 
275
  `common_voice_eo_25167318.mp3`
276
 
@@ -278,7 +278,7 @@ The audio on this is severly garbled. This should absolutely be filtered out of
278
 
279
  No `validation` samples result in `inf` or `nan`.
280
 
281
- The following files out of the 143984 examples in `train` result in `inf` loss:
282
 
283
  ```txt
284
  common_voice_eo_25467641.mp3
@@ -313,6 +313,8 @@ Since this model seems to work well enough, I could run inference on all samples
313
 
314
  #### Test set
315
 
 
 
316
  ```txt
317
  common_voice_eo_25214319.mp3
318
  common_voice_eo_25006596.mp3
@@ -387,11 +389,12 @@ common_voice_eo_25252698.mp3
387
  common_voice_eo_25518636.mp3
388
  ```
389
 
390
- Note on `test[100]` and `test[101]`: We know that `saluton kiel vi fartas` and `atendu momenton` is a good start, but if that's not the text to record, you're not really helping.
391
 
392
  #### Validation set
393
 
394
- 141 of
 
395
  ```txt
396
  common_voice_eo_25392669.mp3
397
  common_voice_eo_25392674.mp3
@@ -414,7 +417,6 @@ common_voice_eo_27380623.mp3
414
 
415
  I didn't include some which had high CER because of hallucinations during a one-word recording with lots of silence before and after. The recording itself is fine on these.
416
 
417
-
418
  #### Training set
419
 
420
  135 of 143984 examples yielded high CER. I removed some from this list that had high CER but sounded fine.
 
231
 
232
  While debugging other training sessions where more data from the Esperanto Common Voice dataset was used -- some loss calculations were returning either `inf` or `nan` -- I found that some of the training set trained with this model had surprisingly high CER. Some examples:
233
 
234
+ | file | Actual<br>---<br>Predicted | CER | Comment |
235
  |:-----|:--------------------|:----|:--------|
236
+ |common_voice_eo_25365027.mp3 | en la hansaj agentejoj komercistoj el la regiono renkontis kolegojn el aliaj regionoj<br>---<br>a taaj keo eoj eejn kigos eegoj eioeegiooj| 0.61 | No audio |
237
+ |common_voice_eo_25365472.mp3 | ili vendas armilojn kaj teknologiojn al la fanatikuloj por gajni monon monon monon<br>---<br>ila mamato aiil ajn kno ion a a aotigojn pu aiooo aj knon | 0.55 | Barely any audio, distorted |
238
+ |common_voice_eo_25365836.mp3 | industria apliko estas la kreado de modifitaj bakterioj kiuj produktas deziratan kemian substancon<br>---<br>iiti sieetas la eeadooddddooiooaotooeioj aiicenon | 0.67 | Barely any audio, distorted |
239
+ |2600 | ili akiras plenkreskan plumaron nur en la kvina jaro<br>---<br>ili aaros peetaj patato a a sia ro | 0.52 | It's literally someone saying 'injabum'. Thanks, troll. |
240
+ |7333 | poste sekvas difinoj de la termino<br>---<br>po | 0.94 | No audio |
241
+ |7334 | li gvidis multajn kursojn laŭ la csehmetodo<br>---<br>po | 0.98 | No audio |
242
+ |7429 | tamen pro la rekonstruo de kluzoj ne eblas trapasi komplete<br>---<br>po | 0.97 | No audio |
243
+ |11662 | lingvotesto estas postulata ekzemple por akceptiĝo en anglalingvaj altlernejoj<br>---<br>linkonteto estastitot etateerteito en pootaeaje lgijoj | 0.58 | No audio |
244
 
245
  Some examples have no audio. All of these files in the dataset are completely useless, and should be removed from the training set.
246
 
 
270
  metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
271
  ```
272
 
273
+ Doing this shows that of the 14913 examples in `test`, the following example results in `inf` loss:
274
 
275
  `common_voice_eo_25167318.mp3`
276
 
 
278
 
279
  No `validation` samples result in `inf` or `nan`.
280
 
281
+ The following 18 out of 143984 examples in `train` result in `inf` loss:
282
 
283
  ```txt
284
  common_voice_eo_25467641.mp3
 
313
 
314
  #### Test set
315
 
316
+ 71 of 14913 examples in the test set show high CER.
317
+
318
  ```txt
319
  common_voice_eo_25214319.mp3
320
  common_voice_eo_25006596.mp3
 
389
  common_voice_eo_25518636.mp3
390
  ```
391
 
392
+ Note on two of the examples: We know that _saluton kiel vi fartas_ ("Hello, how are you") and _atendu momenton_ ("Wait a moment") is a good start in learning Esperanto, but if that's not the text to record, you're not really helping.
393
 
394
  #### Validation set
395
 
396
+ 17 of 14909 examples in the test set show high CER.
397
+
398
  ```txt
399
  common_voice_eo_25392669.mp3
400
  common_voice_eo_25392674.mp3
 
417
 
418
  I didn't include some which had high CER because of hallucinations during a one-word recording with lots of silence before and after. The recording itself is fine on these.
419
 
 
420
  #### Training set
421
 
422
  135 of 143984 examples yielded high CER. I removed some from this list that had high CER but sounded fine.