Fine-tuning
Hello, can you explain how to fine-tuning your module for other voice and language?
Hi SumZbrod, thank you for your interest!
Theoretically, for speakers fine-tuning specific, you only need to drop the existing speaker embeddings and add yours. If your speakers don't have abundant voice-text samples, it is recommended to freeze all other layers except speaker embedding layer.
For language fine-tuning, you need to replace the phoneme embedding with your target language's, and continue to train all layers until model converges.
You may also need the Discriminator (D_*.pth) module for fine-tuning, but I'm sorry that the Discriminator module of the current model has been deleted (to save disk space)π₯ so I'm unable to make it published.
However, there is a trilingual model under training right now. Once it is finished I'll create a repo specifically on fine-tuning. Will notify you when it is doneπ€
Thanks for the detailed answer. We really really like your model, and we discuss it and share voice examples here https://2ch.hk/ai/res/88212.html#92244 . Therefore, we want to create something similar in Russian. Could you share what documentation \ articles \ videos you followed for this, except for the paytorch of course. And good luck with your new goals.
To train the plain vanilla version of VITS model, simply follow the guide in their original repository https://github.com/jaywalnut310/vits, whose details could be found in the published paper.
To adapt it to other languages, you may refer to https://github.com/CjangCjengh/vits, which has provided the preprocessors (cleaners) for Japanese, Chinese, and Korean.
I'm sorry that I haven't seen anyone created a language cleaner for Russian, but I think you can refer to the previous works on Russian TTS, and see how they phonemize this language.
Once the language is converted to discrete phonemes, the rest of the training process will be exactly the same.
Good luck to you and hope you could succeed in your own model training!