|
# FAcodec |
|
|
|
Pytorch implementation for the training of FAcodec, which was proposed in paper [NaturalSpeech 3: Zero-Shot Speech Synthesis |
|
with Factorized Codec and Diffusion Models](https://arxiv.org/pdf/2403.03100) |
|
|
|
A dedicated repository for the FAcodec model can also be find [here](https://github.com/Plachtaa/FAcodec). |
|
|
|
This implementation made some key improvements to the training pipeline, so that the requirements of any form of annotations, including |
|
transcripts, phoneme alignments, and speaker labels, are eliminated. All you need are simply raw speech files. |
|
With the new training pipeline, it is possible to train the model on more languages with more diverse timbre distributions. |
|
We release the code for training and inference, including a pretrained checkpoint on 50k hours speech data with over 1 million speakers. |
|
|
|
## Model storage |
|
We provide pretrained checkpoints on 50k hours speech data. |
|
|
|
| Model type | Link | |
|
|-------------------|----------------------------------------------------------------------------------------------------------------------------------------| |
|
| FAcodec | [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-FAcodec-blue)](https://huggingface.co/Plachta/FAcodec) | |
|
|
|
## Demo |
|
Try our model on [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/Plachta/FAcodecV2)! |
|
|
|
## Training |
|
Prepare your data and put them under one folder, internal file structure does not matter. |
|
Then, change the `dataset` in `./egs/codec/FAcodec/exp_custom_data.json` to the path of your data folder. |
|
Finally, run the following command: |
|
```bash |
|
sh ./egs/codec/FAcodec/train.sh |
|
``` |
|
|
|
## Inference |
|
To reconstruct a speech file, run: |
|
```bash |
|
python ./bins/codec/inference.py --source <source_wav> --output_dir <output_dir> --checkpoint_path <checkpoint_path> |
|
``` |
|
To use zero-shot voice conversion, run: |
|
```bash |
|
python ./bins/codec/inference.py --source <source_wav> --reference <reference_wav> --output_dir <output_dir> --checkpoint_path <checkpoint_path> |
|
``` |
|
|
|
## Feature extraction |
|
When running `./bins/codec/inference.py`, check the returned results of the `FAcodecInference` class: a tuple of `(quantized, codes)` |
|
- `quantized` is the quantized representation of the input speech file. |
|
- `quantized[0]` is the quantized representation of prosody |
|
- `quantized[1]` is the quantized representation of content |
|
|
|
- `codes` is the discrete code representation of the input speech file. |
|
- `codes[0]` is the discrete code representation of prosody |
|
- `codes[1]` is the discrete code representation of content |
|
|
|
For the most clean content representation without any timbre, we suggest to use `codes[1][:, 0, :]`, which is the first layer of content codebooks. |