Spaces:
Running
on
A10G
Running
on
A10G
# EnCodec: High Fidelity Neural Audio Compression | |
AudioCraft provides the training code for EnCodec, a state-of-the-art deep learning | |
based audio codec supporting both mono stereo audio, presented in the | |
[High Fidelity Neural Audio Compression][arxiv] paper. | |
Check out our [sample page][encodec_samples]. | |
## Original EnCodec models | |
The EnCodec models presented in High Fidelity Neural Audio Compression can be accessed | |
and used with the [EnCodec repository](https://github.com/facebookresearch/encodec). | |
**Note**: We do not guarantee compatibility between the AudioCraft and EnCodec codebases | |
and released checkpoints at this stage. | |
## Installation | |
Please follow the AudioCraft installation instructions from the [README](../README.md). | |
## Training | |
The [CompressionSolver](../audiocraft/solvers/compression.py) implements the audio reconstruction | |
task to train an EnCodec model. Specifically, it trains an encoder-decoder with a quantization | |
bottleneck - a SEANet encoder-decoder with Residual Vector Quantization bottleneck for EnCodec - | |
using a combination of objective and perceptual losses in the forms of discriminators. | |
The default configuration matches a causal EnCodec training with at a single bandwidth. | |
### Example configuration and grids | |
We provide sample configuration and grids for training EnCodec models. | |
The compression configuration are defined in | |
[config/solver/compression](../config/solver/compression). | |
The example grids are available at | |
[audiocraft/grids/compression](../audiocraft/grids/compression). | |
```shell | |
# base causal encodec on monophonic audio sampled at 24 khz | |
dora grid compression.encodec_base_24khz | |
# encodec model used for MusicGen on monophonic audio sampled at 32 khz | |
dora grid compression.encodec_musicgen_32khz | |
``` | |
### Training and valid stages | |
The model is trained using a combination of objective and perceptual losses. | |
More specifically, EnCodec is trained with the MS-STFT discriminator along with | |
objective losses through the use of a loss balancer to effectively weight | |
the different losses, in an intuitive manner. | |
### Evaluation stage | |
Evaluations metrics for audio generation: | |
* SI-SNR: Scale-Invariant Signal-to-Noise Ratio. | |
* ViSQOL: Virtual Speech Quality Objective Listener. | |
Note: Path to the ViSQOL binary (compiled with bazel) needs to be provided in | |
order to run the ViSQOL metric on the reference and degraded signals. | |
The metric is disabled by default. | |
Please refer to the [metrics documentation](../METRICS.md) to learn more. | |
### Generation stage | |
The generation stage consists in generating the reconstructed audio from samples | |
with the current model. The number of samples generated and the batch size used are | |
controlled by the `dataset.generate` configuration. The output path and audio formats | |
are defined in the generate stage configuration. | |
```shell | |
# generate samples every 5 epoch | |
dora run solver=compression/encodec_base_24khz generate.every=5 | |
# run with a different dset | |
dora run solver=compression/encodec_base_24khz generate.path=<PATH_IN_DORA_XP_FOLDER> | |
# limit the number of samples or use a different batch size | |
dora grid solver=compression/encodec_base_24khz dataset.generate.num_samples=10 dataset.generate.batch_size=4 | |
``` | |
### Playing with the model | |
Once you have a model trained, it is possible to get the entire solver, or just | |
the trained model with the following functions: | |
```python | |
from audiocraft.solvers import CompressionSolver | |
# If you trained a custom model with signature SIG. | |
model = CompressionSolver.model_from_checkpoint('//sig/SIG') | |
# If you want to get one of the pretrained models with the `//pretrained/` prefix. | |
model = CompressionSolver.model_from_checkpoint('//pretrained/facebook/encodec_32khz') | |
# Or load from a custom checkpoint path | |
model = CompressionSolver.model_from_checkpoint('/my_checkpoints/foo/bar/checkpoint.th') | |
# If you only want to use a pretrained model, you can also directly get it | |
# from the CompressionModel base model class. | |
from audiocraft.models import CompressionModel | |
# Here do not put the `//pretrained/` prefix! | |
model = CompressionModel.get_pretrained('facebook/encodec_32khz') | |
model = CompressionModel.get_pretrained('dac_44khz') | |
# Finally, you can also retrieve the full Solver object, with its dataloader etc. | |
from audiocraft import train | |
from pathlib import Path | |
import logging | |
import os | |
import sys | |
# uncomment the following line if you want some detailed logs when loading a Solver. | |
logging.basicConfig(stream=sys.stderr, level=logging.INFO) | |
# You must always run the following function from the root directory. | |
os.chdir(Path(train.__file__).parent.parent) | |
# You can also get the full solver (only for your own experiments). | |
# You can provide some overrides to the parameters to make things more convenient. | |
solver = train.get_solver_from_sig('SIG', {'device': 'cpu', 'dataset': {'batch_size': 8}}) | |
solver.model | |
solver.dataloaders | |
``` | |
### Importing / Exporting models | |
At the moment we do not have a definitive workflow for exporting EnCodec models, for | |
instance to Hugging Face (HF). We are working on supporting automatic convertion between | |
AudioCraft and Hugging Face implementations. | |
We still have some support for fine tuning an EnCodec model coming from HF in AudioCraft, | |
using for instance `continue_from=//pretrained/facebook/encodec_32k`. | |
An AudioCraft checkpoint can be exported in a more compact format (excluding the optimizer etc.) | |
using `audiocraft.utils.export.export_encodec`. For instance, you could run | |
```python | |
from audiocraft.utils import export | |
from audiocraft import train | |
xp = train.main.get_xp_from_sig('SIG') | |
export.export_encodec( | |
xp.folder / 'checkpoint.th', | |
'/checkpoints/my_audio_lm/compression_state_dict.bin') | |
from audiocraft.models import CompressionModel | |
model = CompressionModel.get_pretrained('/checkpoints/my_audio_lm/compression_state_dict.bin') | |
from audiocraft.solvers import CompressionSolver | |
# The two are strictly equivalent, but this function supports also loading from non already exported models. | |
model = CompressionSolver.model_from_checkpoint('//pretrained//checkpoints/my_audio_lm/compression_state_dict.bin') | |
``` | |
We will see then how to use this model as a tokenizer for MusicGen/Audio gen in the | |
[MusicGen documentation](./MUSICGEN.md). | |
### Learn more | |
Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md). | |
## Citation | |
``` | |
@article{defossez2022highfi, | |
title={High Fidelity Neural Audio Compression}, | |
author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi}, | |
journal={arXiv preprint arXiv:2210.13438}, | |
year={2022} | |
} | |
``` | |
## License | |
See license information in the [README](../README.md). | |
[arxiv]: https://arxiv.org/abs/2210.13438 | |
[encodec_samples]: https://ai.honu.io/papers/encodec/samples.html | |