Spaces:
Running
on
Zero
Support for Voice Cloning
Are there any plans on adding voice-cloning functionality?
Probably no voice cloning on the horizon, unless enormous amounts of compute and data fall into my lap. I know datasets like Emilia exist, but I'm so far unwilling to introduce CC BY-NC
data into Kokoro's training mix. And unless you buy high quality data in large quantities, you typically compromise the quality of your data when you scale up, and for TTS that could translate to potential artifacts, noise, less stability on the "default" speakers. There are definitely research solutions to that, like pretraining/posttraining regimes, but out of scope for now.
But there is a feature on the todo list that allows "cooking" hybrid speakers by mixing together known speakers, which I will try to ship soon (TM). For example, the current default speaker 🇺🇸 🚺 American Female ⭐
is simply a 50-50 mix of Bella and Sarah.
Here's a relevant memo I wrote earlier:
Currently, Kokoro does not have an effective voice cloning capability. In my estimation, effective voice cloning (aka zero-shots) require you to see thousands of speakers in training, and depending on your definition of "effective" it could be an OOM higher than that.
E2/F5 is okay at voice cloning: it has 330M params and was trained on Emilia, which IIRC is ~100k hours of audio. So is XTTSv2, which has 460M params and was trained on a commensurate amount of audio, I'm sure. Meanwhile Kokoro is 80M params and was trained on OOMs less audio. In this context, you can imagine the number of speakers seen is correlated to the total duration of audio.
It would not be surprising to me if voice cloning is simply "looking up" the most similar speaker or interpolation of speakers seen in training. François Chollet has discussed this phenomenon many times wrt LLMs, and I highly recommend to listening to his talks. If you belong to this school of thought (and likely even if you don't), then you know there is a vast difference in voice cloning capability between a model that has trained on 100 hours of audio vs one that has seen 100k hours.