{}
AudioLDM
AudioLDM is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input. It is available in the 🧨 Diffusers library from v0.14.0 onwards.
Model Details
AudioLDM was proposed in the paper AudioLDM: Text-to-Audio Generation with Latent Diffusion Models by Haohe Liu et al.
Inspired by Stable Diffusion, AudioLDM is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from CLAP latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.
Model Sources
Usage
First, install the required packages:
pip install --upgrade git+https://github.com/huggingface/diffusers git+https://github.com/huggingface/transformers scipy
Text-to-Audio
For text-to-audio generation, the AudioLDMPipeline can be used to load pre-trained weights and generate text-conditional audio outputs:
from diffusers import AudioLDMPipeline
import torch
repo_id = "cvssp/audioldm"
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=10, height=512).audios[0]
The resulting audio output can be saved as a .wav file:
import scipy
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
Or displayed in a Jupyter Notebook / Google Colab:
from IPython.display import Audio
Audio(audio, 16000))
Tips
- Try to provide descriptive text inputs to AudioLDM. You can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
- It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.
- The quality of the predicted audio sample can be controlled by the
num_inference_steps
argument: higher steps give higher quality audio at the expense of slower inference. - The length of the predicted audio sample can be controlled by varying the
height
argument: larger heights give longer spectrograms and thus longer audio samples at the expense of slower inference.
Citation
BibTeX:
@article{liu2023audioldm,
title={AudioLDM: Text-to-Audio Generation with Latent Diffusion Models},
author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
journal={arXiv preprint arXiv:2301.12503},
year={2023}
}