trying to hack together a voice cloning demo....
I've been trying to create my own custom embeddings using speechbrain/spkrec-xvect-voxceleb
signal, fs =torchaudio.load('morgan.wav')
embeddings = classifier.encode_batch(signal)
and generating audio using:
speech = model.generate_speech(inputs["input_ids"], embeddings[0], vocoder=vocoder)
but having the output garbled. is there an intermediary step i'm missing ?
so managed to get a non-garbled output. after resampling my wav file and converting it to mono. now to figure out how to improve the quality of voice reproduction.
Hi, Thanks for your attention.
According to model.generate_speech, the src_tokens
is required.
Thus, we recommend to implement as follows.
speech = model.generate_speech(src_tokens=inputs["input_ids"], spkembs=embeddings[0], ...)
Free for additional questions.
I've been trying to create my own custom embeddings using speechbrain/spkrec-xvect-voxceleb
signal, fs =torchaudio.load('morgan.wav')
embeddings = classifier.encode_batch(signal)and generating audio using:
speech = model.generate_speech(inputs["input_ids"], embeddings[0], vocoder=vocoder)
but having the output garbled. is there an intermediary step i'm missing ?
Does the inputs["input_ids"] denote words? It seems waveform.