microsoft/speecht5_tts · Fine Tunning the model

I been for a while trying to finetune the model with no success till today. I will comment here some of the issues i ran into.
The VoxPopuli dataset is not compatible with Windows. The audio samples file names are not valid in windows, you can rename all the files and fix the spreadsheets accordingly.

In my case, I created a new dataset just like VoxPopuli but with my own data.
In my copy of S"peechT5 TTS Fine-tuning.ipynb" I added "!pip install sentencepiece" to the second cell.

This got the ball rolling and I was able to start the training. After a couple of minutes training I got an error tensor size mismatch.
I changed my training size and test size, and i got the same error at a different step and with different tensor sizes. This lead me to think my issue was related the dataset.

My first thought was that I had samples that were too long, so I changed the is_not_too_long function to filter out samples with more than 100 tokens.
Again similar error with different sizes.
I read somewhere the issue may be caused by samples with had no labels, but could not find any in my dataset.
Just in case I modified is_not_too_long function again, this time to filter out all samples with less than 25 and more 100 than tokens.
This final change gave me two surprises.

I managed to finetune the model
After all the filtering I went from 1500 or so samples to 135 for training and 15 for testing. I'm amazed how well the results sound.