File size: 3,064 Bytes
f9ee650
 
 
5a0dcf3
 
 
 
ab88f61
5a0dcf3
 
 
 
 
 
 
ab88f61
5a0dcf3
 
 
 
 
 
 
 
 
ab88f61
 
 
 
 
5a0dcf3
 
 
 
 
 
 
 
475dc33
 
 
 
5a0dcf3
 
 
 
6ed754f
 
 
 
 
 
 
 
5a0dcf3
 
 
98e6ee3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
license: cc-by-nc-nd-4.0
---

# ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

This page shares the official model checkpoints of the paper \
*ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation* \
from Microsoft Applied Science Group and UC Berkeley \
by [Yatong Bai](https://bai-yt.github.io),
[Trung Dang](https://www.microsoft.com/applied-sciences/people/trung-dang),
[Dung Tran](https://www.microsoft.com/applied-sciences/people/dung-tran),
[Kazuhito Koishida](https://www.microsoft.com/applied-sciences/people/kazuhito-koishida),
and [Somayeh Sojoudi](https://people.eecs.berkeley.edu/~sojoudi/).

**[[🤗 Live Demo](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA)]**     
**[[Preprint Paper](https://arxiv.org/abs/2309.10740)]**     
**[[Project Homepage](https://consistency-tta.github.io)]**     
**[[Code](https://github.com/Bai-YT/ConsistencyTTA)]**     
**[[Model Checkpoints](https://huggingface.co/Bai-YT/ConsistencyTTA)]**     
**[[Generation Examples](https://consistency-tta.github.io/demo.html)]**


## Description

**2024/06 Updates:**

- We have hosted an interactive live demo of ConsistencyTTA at [🤗 Huggingface](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA).
- ConsistencyTTA has been accepted to ***INTERSPEECH 2024***! We look forward to meeting you in Kos Island.

This work proposes a *consistency distillation* framework to train
text-to-audio (TTA) generation models that only require a single neural network query,
reducing the computation of the core step of diffusion-based TTA models by a factor of 400.
By incorporating *classifier-free guidance* into the distillation framework,
our models retain diffusion models' impressive generation quality and diversity.
Furthermore, the non-recurrent differentiable structure of the consistency model
allows for end-to-end fine-tuning with novel loss functions such as the CLAP score, further boosting performance.

<center>
  <img src="main_figure_.png" alt="ConsistencyTTA Results" title="Results" width="480"/>
</center>


## Model Details

We share three model checkpoints:
- [ConsistencyTTA directly distilled from a diffusion model](
  https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA.zip);
- [ConsistencyTTA fine-tuned by optimizing the CLAP score](
  https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA_CLAPFT.zip);
- [The diffusion teacher model from which ConsistencyTTA is distilled](
  https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/LightweightLDM.zip).

The first two models are capable of high-quality single-step text-to-audio generation. Generations are 10 seconds long.

After downloading and unzipping the files, place them in the `saved` directory.

The training and inference code are on our [GitHub page](https://github.com/Bai-YT/ConsistencyTTA). Please refer to the GitHub page for usage details.