Spaces:
Running
Running
# Amphion Text-to-Audio (TTA) Recipe | |
## Quick Start | |
We provide a **[beginner recipe](RECIPE.md)** to demonstrate how to train a cutting edge TTA model. Specifically, it is designed as a latent diffusion model like [AudioLDM](https://arxiv.org/abs/2301.12503), [Make-an-Audio](https://arxiv.org/abs/2301.12661), and [AUDIT](https://arxiv.org/abs/2304.00830). | |
## Supported Model Architectures | |
Until now, Amphion has supported a latent diffusion based text-to-audio model: | |
<br> | |
<div align="center"> | |
<img src="../../imgs/tta/DiffusionTTA.png" width="65%"> | |
</div> | |
<br> | |
Similar to [AUDIT](https://arxiv.org/abs/2304.00830), we implement it in two-stage training: | |
1. Training the VAE which is called `AutoencoderKL` in Amphion. | |
2. Training the conditional latent diffusion model which is called `AudioLDM` in Amphion. |