Spaces:
Runtime error
Runtime error
train latent dm with pre-trained vae from hf hub
Browse files- README.md +10 -1
- scripts/train_unconditional.py +14 -3
README.md
CHANGED
@@ -119,11 +119,13 @@ accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
|
119 |
--lr_warmup_steps 500 \
|
120 |
--mixed_precision no
|
121 |
```
|
|
|
122 |
## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
|
123 |
#### A DDIM can be trained by adding the parameter
|
124 |
```bash
|
125 |
--scheduler ddim
|
126 |
```
|
|
|
127 |
Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
|
128 |
|
129 |
## Latent Audio Diffusion
|
@@ -131,7 +133,14 @@ Rather than de-noising images directly, it is interesting to work in the "latent
|
|
131 |
|
132 |
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
|
133 |
|
134 |
-
####
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
135 |
```
|
136 |
pip install omegaconf pytorch_lightning
|
137 |
pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
|
|
|
119 |
--lr_warmup_steps 500 \
|
120 |
--mixed_precision no
|
121 |
```
|
122 |
+
|
123 |
## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
|
124 |
#### A DDIM can be trained by adding the parameter
|
125 |
```bash
|
126 |
--scheduler ddim
|
127 |
```
|
128 |
+
|
129 |
Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
|
130 |
|
131 |
## Latent Audio Diffusion
|
|
|
133 |
|
134 |
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
|
135 |
|
136 |
+
#### Train latent diffusion model using pre-trained VAE.
|
137 |
+
```bash
|
138 |
+
accelerate launch ...
|
139 |
+
...
|
140 |
+
--vae teticio/latent-audio-diffusion-256
|
141 |
+
```
|
142 |
+
|
143 |
+
#### Install dependencies to train with Stable Diffusion.
|
144 |
```
|
145 |
pip install omegaconf pytorch_lightning
|
146 |
pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
|
scripts/train_unconditional.py
CHANGED
@@ -11,6 +11,7 @@ from accelerate.logging import get_logger
|
|
11 |
from datasets import load_from_disk, load_dataset
|
12 |
from diffusers import (DiffusionPipeline, DDPMScheduler, UNet2DModel,
|
13 |
DDIMScheduler, AutoencoderKL)
|
|
|
14 |
from diffusers.hub_utils import init_git_repo, push_to_hub
|
15 |
from diffusers.optimization import get_scheduler
|
16 |
from diffusers.training_utils import EMAModel
|
@@ -85,7 +86,11 @@ def main(args):
|
|
85 |
|
86 |
vqvae = None
|
87 |
if args.vae is not None:
|
88 |
-
|
|
|
|
|
|
|
|
|
89 |
# Determine latent resolution
|
90 |
with torch.no_grad():
|
91 |
latent_resolution = vqvae.encode(
|
@@ -93,10 +98,16 @@ def main(args):
|
|
93 |
resolution)).latent_dist.sample().shape[2:]
|
94 |
|
95 |
if args.from_pretrained is not None:
|
96 |
-
pipeline =
|
|
|
|
|
|
|
|
|
|
|
|
|
97 |
model = pipeline.unet
|
98 |
if hasattr(pipeline, 'vqvae'):
|
99 |
-
vqvae =
|
100 |
else:
|
101 |
model = UNet2DModel(
|
102 |
sample_size=resolution if vqvae is None else latent_resolution,
|
|
|
11 |
from datasets import load_from_disk, load_dataset
|
12 |
from diffusers import (DiffusionPipeline, DDPMScheduler, UNet2DModel,
|
13 |
DDIMScheduler, AutoencoderKL)
|
14 |
+
from diffusers.modeling_utils import EntryNotFoundError
|
15 |
from diffusers.hub_utils import init_git_repo, push_to_hub
|
16 |
from diffusers.optimization import get_scheduler
|
17 |
from diffusers.training_utils import EMAModel
|
|
|
86 |
|
87 |
vqvae = None
|
88 |
if args.vae is not None:
|
89 |
+
try:
|
90 |
+
vqvae = AutoencoderKL.from_pretrained(args.vae)
|
91 |
+
except EnvironmentError:
|
92 |
+
vqvae = LatentAudioDiffusionPipeline.from_pretrained(
|
93 |
+
args.vae).vqvae
|
94 |
# Determine latent resolution
|
95 |
with torch.no_grad():
|
96 |
latent_resolution = vqvae.encode(
|
|
|
98 |
resolution)).latent_dist.sample().shape[2:]
|
99 |
|
100 |
if args.from_pretrained is not None:
|
101 |
+
pipeline = {
|
102 |
+
'LatentAudioDiffusionPipeline': LatentAudioDiffusionPipeline,
|
103 |
+
'AudioDiffusionPipeline': AudioDiffusionPipeline
|
104 |
+
}.get(
|
105 |
+
DiffusionPipeline.get_config_dict(
|
106 |
+
args.from_pretrained)['_class_name'], AudioDiffusionPipeline)
|
107 |
+
pipeline = pipeline.from_pretrained(args.from_pretrained)
|
108 |
model = pipeline.unet
|
109 |
if hasattr(pipeline, 'vqvae'):
|
110 |
+
vqvae = pipeline.vqvae
|
111 |
else:
|
112 |
model = UNet2DModel(
|
113 |
sample_size=resolution if vqvae is None else latent_resolution,
|