# FAcodec Pytorch implementation for the training of FAcodec, which was proposed in paper [NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models](https://arxiv.org/pdf/2403.03100) A dedicated repository for the FAcodec model can also be find [here](https://github.com/Plachtaa/FAcodec). This implementation made some key improvements to the training pipeline, so that the requirements of any form of annotations, including transcripts, phoneme alignments, and speaker labels, are eliminated. All you need are simply raw speech files. With the new training pipeline, it is possible to train the model on more languages with more diverse timbre distributions. We release the code for training and inference, including a pretrained checkpoint on 50k hours speech data with over 1 million speakers. ## Model storage We provide pretrained checkpoints on 50k hours speech data. | Model type | Link | |-------------------|----------------------------------------------------------------------------------------------------------------------------------------| | FAcodec | [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-FAcodec-blue)](https://huggingface.co/Plachta/FAcodec) | ## Demo Try our model on [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/Plachta/FAcodecV2)! ## Training Prepare your data and put them under one folder, internal file structure does not matter. Then, change the `dataset` in `./egs/codec/FAcodec/exp_custom_data.json` to the path of your data folder. Finally, run the following command: ```bash sh ./egs/codec/FAcodec/train.sh ``` ## Inference To reconstruct a speech file, run: ```bash python ./bins/codec/inference.py --source --output_dir --checkpoint_path ``` To use zero-shot voice conversion, run: ```bash python ./bins/codec/inference.py --source --reference --output_dir --checkpoint_path ``` ## Feature extraction When running `./bins/codec/inference.py`, check the returned results of the `FAcodecInference` class: a tuple of `(quantized, codes)` - `quantized` is the quantized representation of the input speech file. - `quantized[0]` is the quantized representation of prosody - `quantized[1]` is the quantized representation of content - `codes` is the discrete code representation of the input speech file. - `codes[0]` is the discrete code representation of prosody - `codes[1]` is the discrete code representation of content For the most clean content representation without any timbre, we suggest to use `codes[1][:, 0, :]`, which is the first layer of content codebooks.