Use voice=None to select random voice for hexgrad/kokoro

#14
by hexgrad - opened

Using a single voice introduces the risk that people are voting for or against a model simply because they like or dislike the particular sound of that one voice.

While randomizing from a pool of voices doesn't inherently solve this problem, it mitigates the risk of "overfitting" to a single voice and hopefully provides better signal on the underlying model itself.

It also makes models less identifiable based on the sound of the voice alone (i.e. first second sound), vs the quality of the entire utterance.

See https://huggingface.co/spaces/hexgrad/kokoro/commit/33faceb2e2b30a2493d2298f806cc17aae784250
Edit: Added Adam back in, so the array of 8 voices is balanced per AA methodology below https://huggingface.co/spaces/hexgrad/kokoro/commit/7562a6f1fc2c0eff0f4b144e8f8d258a18f24839

Related methodology from https://artificialanalysis.ai/text-to-speech/methodology

For each model we select 2 voices of each combination of Male and Female, and US and UK accents (8 combinations in total). Where a gender and accent is not available, we exclude this combination from evaluation in the Speech Arena.

Closing because all models in this arena seem to be using American female voices. If Kokoro is the only model that uses male and/or British voices, it would no longer be an apples-to-apples comparison.

On my end, I'll map af_0 to a random American female voice β€” right now there are two normal/stable ones. Hopefully this is fine.

Edit 1: See https://huggingface.co/spaces/hexgrad/kokoro/commit/42f9149edb74cde04490b4c86fe10e468dbbc0d7
Edit 2: Since arena sample audios are cached, adjusting the API on my end has no immediate effect

hexgrad changed pull request status to closed

On my end, I'll map af_0 to a random American female voice β€” right now there are two normal/stable ones. Hopefully this is fine.

Sure, you can do that, but it may do damage to the results DB as you have no clue if the voters didn't like the delivery or the voice itself.

For example, xVAsynth is probably has the highest pitch voice. And easily gets very unnatural when getting emotional.

At some point I'd like to have multiple voices per model and a gender selector.

Sign up or log in to comment