Poor WER results on CV_15
Hi, thanks for providing this nice repo. I have tested your model which is working very well. Great work.
I also tried to train the model on CV_15. However, WER was about 1.0 after 20 hours of training. Here is my bash script
python xls-r-uyghur-cv15/run_speech_recognition_ctc.py \
--dataset_name="mozilla-foundation/common_voice_15_0" \
--model_name_or_path="facebook/wav2vec2-xls-r-300m" \
--dataset_config_name="ug" \
--train_split_name="train+validation" \
--eval_split_name="test" \
--output_dir="./xls-r-uyghur-cv15" \
--overwrite_output_dir \
--num_train_epochs="100" \
--per_device_train_batch_size="16" \
--per_device_eval_batch_size="8" \
--gradient_accumulation_steps="4" \
--learning_rate="1e-4" \
--warmup_steps="2000" \
--length_column_name="input_length" \
--evaluation_strategy="steps" \
--text_column_name="sentence" \
--chars_to_ignore , ? . ! \- \; \: \\ _ \| ‒ ☺ ♂ © « ¬ » \" „ “ % ” � — ’ ، ؛ ؟ ‹ › − … – \
--eval_metrics="wer" \
--save_steps="500" \
--eval_steps="500" \
--logging_steps="100" \
--min_duration_in_seconds="0.2" \
--layerdrop="0.0" \
--activation_dropout="0.1" \
--save_total_limit="3" \
--freeze_feature_encoder \
--feat_proj_dropout="0.0" \
--mask_time_prob="0.75" \
--mask_time_length="10" \
--mask_feature_prob="0.25" \
--mask_feature_length="64" \
--gradient_checkpointing \
--use_auth_token \
--fp16 \
--group_by_length \
--do_train --do_eval \
# --push_to_hub
How can I improve the training? Thanks.
Not sure what it could be, but here are some things to try:
- Does the existing script still work correctly with cv8? It may be that some different version of a dependency is installed in your environment, which may prevent correct training
- Does it train with using just "validation" as train? Does it quickly overtrain if you use "validation" as both train and test? There may be some data quality issue in newer data, or a change in formatting contrary to the assumptions of this script.
Thanks for your reply. I have not tried it with CV8. I will test the code with CV8.
FYI, I have trained the whisper--small-v2 with CV15, and the WER is 27. However, I have used the Uzbek tokeniser and Uyghur Latin Script, as Uyghur is included in the tokeniser.
After tuning some hyperparameters, the model is trainable on the model with the CV15 UG dataset. However, it is hard to find the best hyper-parameters.
hello, osman I also met this problem with cv13 and cv16 on ug dataset. could you give me some suggestion about hyper-parameters?
@osman Thanks for the suggestion. I tuned with several different lr and warmup-steps but the model does not converge. The training loss decrease normally but the validation loss goes like a "V". The same thing happend with whisper peft fintune. I am using the UAS as tokenizer. My total numer of token is 75, however the actual number of UAS shoud be 34?
@kli017 That is what I encountered. Fine-tuning parameters helps, but the final results are still not good. I have trained whisper with Uzbek tokeniser, the results are better than WER. I have converted UAS to Uyghur Latin Script and then used the Uzbek tokeniser. The training was smooth. I have not played around with any hyper-parameters. I got WER about 25% on CV16.
Here is the model: https://huggingface.co/osman/whisper-small-ug
@kli017 I didn't understand the "unseen" you refer to. The examples you listed are all Uyghur Arabic Characters but with various shapes. They all can be converted to Uyghur Latin Script. Check out this repo for conversion: https://github.com/neouyghur/ScriptConverter4Uyghur