maskgct / egs /svc /README.md
Hecheng0625's picture
Upload 167 files
8c92a11 verified
|
raw
history blame
2.53 kB

Amphion Singing Voice Conversion (SVC) Recipe

Quick Start

We provide a beginner recipe to demonstrate how to train a cutting edge SVC model. Specifically, it is also an official implementation of the paper "Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion" (2024 IEEE Spoken Language Technology Workshop). Some demos can be seen here.

Supported Model Architectures

The main idea of SVC is to first disentangle the speaker-agnostic representations from the source audio, and then inject the desired speaker information to synthesize the target, which usually utilizes an acoustic decoder and a subsequent waveform synthesizer (vocoder):



Until now, Amphion SVC has supported the following features and models:

  • Speaker-agnostic Representations:
  • Speaker Embeddings:
    • Speaker Look-Up Table.
    • Reference Encoder (πŸ‘¨β€πŸ’» developing): It can be used for zero-shot SVC.
  • Acoustic Decoders:
    • Diffusion-based models:
    • Transformer-based models:
      • TransformerSVC: Encoder-only and Non-autoregressive Transformer Architecture.
    • VAE- and Flow-based models:
      • VitsSVC: It is designed as a VITS-like model whose textual input is replaced by the content features, which is similar to so-vits-svc.
  • Waveform Synthesizers (Vocoders):