EzAudio / README.md
OpenSound's picture
Update README.md
d7ef1fd verified
metadata
license: mit
tags:
  - text-to-audio
  - controlnet

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.

🎛 Play with EzAudio for text-to-audio generation, editing, and inpainting: EzAudio

🎮 EzAudio-ControlNet is available: EzAudio-ControlNet

We want to thank Hugging Face Space and Gradio for providing incredible demo platform.

Installation

Clone the repository:

git clone [email protected]:haidog-yaqub/EzAudio.git

Install the dependencies:

cd EzAudio
pip install -r requirements.txt

Download checkponts from: https://huggingface.co/OpenSound/EzAudio

Usage

You can use the model with the following code:

from api.ezaudio import load_models, generate_audio

# model and config paths
config_name = 'ckpts/ezaudio-xl.yml'
ckpt_path = 'ckpts/s3/ezaudio_s3_xl.pt'
vae_path = 'ckpts/vae/1m.pt'
# save_path = 'output/'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load model
(autoencoder, unet, tokenizer,
 text_encoder, noise_scheduler, params) = load_models(config_name, ckpt_path,
                                                      vae_path, device)

prompt = "a dog barking in the distance"
sr, audio = generate_audio(prompt, autoencoder, unet, tokenizer, text_encoder, noise_scheduler, params, device)

Todo

  • Release Gradio Demo along with checkpoints EzAudio Space
  • Release ControlNet Demo along with checkpoints EzAudio ControlNet Space
  • Release inference code
  • Release checkpoints for stage1 and stage2
  • Release training pipeline and dataset

Reference

If you find the code useful for your research, please consider citing:

@article{hai2024ezaudio,
  title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
  author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
  journal={arXiv preprint arXiv:2409.10819},
  year={2024}
}

Acknowledgement

Some code are borrowed from or inspired by: U-Vit, Pixel-Art, Huyuan-DiT, and Stable Audio.