DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

DiffSpeech (TTS)

1. Preparation

Data Preparation

a) Download and extract the LJ Speech dataset, then create a link to the dataset folder: ln -s /xxx/LJSpeech-1.1/ data/raw/

b) Download and Unzip the ground-truth duration extracted by MFA: tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/

c) Run the following scripts to pack the dataset for training/inference.

export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml

# `data/binary/ljspeech` will be generated.

Vocoder Preparation

We provide the pre-trained model of HifiGAN vocoder. Please unzip this file into checkpoints before training your acoustic model.

2. Training Example

First, you need a pre-trained FastSpeech2 checkpoint. You can use the pre-trained model, or train FastSpeech2 from scratch, run:

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config configs/tts/lj/fs2.yaml --exp_name fs2_lj_1 --reset

Then, to train DiffSpeech, run:

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset

Remember to adjust the "fs2_ckpt" parameter in usr/configs/lj_ds_beta6.yaml to fit your path.

3. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset --infer

We also provide:

the pre-trained model of DiffSpeech;
the individual pre-trained model of FastSpeech 2 for the shallow diffusion mechanism in DiffSpeech;

Remember to put the pre-trained models in checkpoints directory.

Mel Visualization

Along vertical axis, DiffSpeech: [0-80]; FastSpeech2: [80-160].

DiffSpeech vs. FastSpeech 2