update README
Browse files
README.md
CHANGED
@@ -1 +1,151 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- text-to-image
|
4 |
+
- KOALA
|
5 |
+
---
|
6 |
+
|
7 |
+
<div align="center">
|
8 |
+
<img src="https://dl.dropboxusercontent.com/scl/fi/yosvi68jvyarbvymxc4hm/github_logo.png?rlkey=r9ouwcd7cqxjbvio43q9b3djd&dl=1" width="1024px" />
|
9 |
+
</div>
|
10 |
+
|
11 |
+
|
12 |
+
|
13 |
+
<div style="display:flex;justify-content: center">
|
14 |
+
<a href="https://youngwanlee.github.io/KOALA/"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Github&color=blue&logo=github-pages"></a>  
|
15 |
+
<a href="https://github.com/youngwanLEE/sdxl-koala"><img src="https://img.shields.io/static/v1?label=Code&message=Github&color=blue&logo=github"></a>  
|
16 |
+
<a href="https://arxiv.org/abs/2312.04005"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:KOALA&color=red&logo=arxiv"></a>  
|
17 |
+
</div>
|
18 |
+
|
19 |
+
|
20 |
+
|
21 |
+
# KOALA-1B Model Card
|
22 |
+
|
23 |
+
|
24 |
+
## Abstract
|
25 |
+
### TL;DR
|
26 |
+
> We propose a fast text-to-image model, called KOALA, by compressing SDXL's U-Net and distilling knowledge from SDXL into our model. KOALA-700M can generate a 1024x1024 image in less than 1.5 seconds on an NVIDIA 4090 GPU, which is more than 2x faster than SDXL.
|
27 |
+
|
28 |
+
<details><summary>FULL abstract</summary>
|
29 |
+
Stable diffusion is the mainstay of the text-to-image (T2I) synthesis in the community due to its generation performance and open-source nature.
|
30 |
+
Recently, Stable Diffusion XL (SDXL), the successor of stable diffusion, has received a lot of attention due to its significant performance improvements with a higher resolution of 1024x1024 and a larger model.
|
31 |
+
However, its increased computation cost and model size require higher-end hardware (e.g., bigger VRAM GPU) for end-users, incurring higher costs of operation.
|
32 |
+
To address this problem, in this work, we propose an efficient latent diffusion model for text-to-image synthesis obtained by distilling the knowledge of SDXL.
|
33 |
+
To this end, we first perform an in-depth analysis of the denoising U-Net in SDXL, which is the main bottleneck of the model, and then design a more efficient U-Net based on the analysis.
|
34 |
+
Secondly, we explore how to effectively distill the generation capability of SDXL into an efficient U-Net and eventually identify four essential factors, the core of which is that self-attention is the most important part.
|
35 |
+
With our efficient U-Net and self-attention-based knowledge distillation strategy, we build our efficient T2I models, called KOALA-1B &-700M, while reducing the model size up to 54% and 69% of the original SDXL model.
|
36 |
+
In particular, the KOALA-700M is more than twice as fast as SDXL while still retaining a decent generation quality.
|
37 |
+
We hope that due to its balanced speed-performance tradeoff, our KOALA models can serve as a cost-effective alternative to SDXL in resource-constrained environments.
|
38 |
+
</details>
|
39 |
+
|
40 |
+
<br>
|
41 |
+
|
42 |
+
|
43 |
+
These 1024x1024 samples are generated by KOALA-700M with 25 denoising steps.
|
44 |
+
|
45 |
+
<div align="center">
|
46 |
+
<img src="https://dl.dropboxusercontent.com/scl/fi/rjsqqgfney7be069y2yr7/teaser.png?rlkey=7lq0m90xpjcoqclzl4tieajpo&dl=1" width="1024px" />
|
47 |
+
</div>
|
48 |
+
|
49 |
+
|
50 |
+
## Architecture
|
51 |
+
There are two two types of compressed U-Net, KOALA-1B and KOALA-700M, which are realized by reducing residual blocks and transformer blocks.
|
52 |
+
|
53 |
+
<div align="center">
|
54 |
+
<img src="https://dl.dropboxusercontent.com/scl/fi/5ydeywgiyt1d3njw63dpk/arch.png?rlkey=1p6imbjs4lkmfpcxy153i1a2t&dl=1" width="1024px" />
|
55 |
+
</div>
|
56 |
+
|
57 |
+
### U-Net comparison
|
58 |
+
|
59 |
+
| U-Net | SDM-v2.0 | SDXL-Base-1.0 | KOALA-1B | KOALA-700M |
|
60 |
+
|-------|:----------:|:-----------:|:-----------:|:-------------:|
|
61 |
+
| Param. | 865M | 2,567M | 1,161M | 782M |
|
62 |
+
| CKPT size | 3.46GB | 10.3GB | 4.4GB | 3.0GB |
|
63 |
+
| Tx blocks | [1, 1, 1, 1] | [0, 2, 10] | [0, 2, 6] | [0, 2, 5] |
|
64 |
+
| Mid block | ✓ | ✓ | ✓ | ✗ |
|
65 |
+
| Latency | 1.131s | 3.133s | 1.604s | 1.257s |
|
66 |
+
|
67 |
+
- Tx menans transformer block and CKPT means the trained checkpoint file.
|
68 |
+
- We measured latency with FP16-precision, and 25 denoising steps in NVIDIA 4090 GPU (24GB).
|
69 |
+
- SDM-v2.0 uses 768x768 resolution, while SDXL and KOALA models uses 1024x1024 resolution.
|
70 |
+
|
71 |
+
|
72 |
+
## Latency and memory usage comparison on different GPUs
|
73 |
+
|
74 |
+
We measure the inference time of SDM-v2.0 with 768x768 resolution and the other models with 1024x1024 using a variety of consumer-grade GPUs: NVIDIA 3060Ti (8GB), 2080Ti (11GB), and 4090 (24GB). We use 25 denoising steps and FP16/FP32 precisions. OOM means Out-of-Memory. Note that SDXL-Base cannot operate in the 8GB-GPU.
|
75 |
+
|
76 |
+
|
77 |
+
<div align="center">
|
78 |
+
<img src="https://dl.dropboxusercontent.com/scl/fi/u1az20y0zfww1l5lhbcyd/latency_gpu.svg?rlkey=vjn3gpkmywmp7jpilar4km7sd&dl=1" width="1024px" />
|
79 |
+
</div>
|
80 |
+
|
81 |
+
|
82 |
+
|
83 |
+
|
84 |
+
|
85 |
+
## Key Features
|
86 |
+
- **Efficient U-Net Architecture**: KOALA models use a simplified U-Net architecture that reduces the model size by up to 54% and 69% respectively compared to its predecessor, Stable Diffusion XL (SDXL).
|
87 |
+
- **Self-Attention-Based Knowledge Distillation**: The core technique in KOALA focuses on the distillation of self-attention features, which proves crucial for maintaining image generation quality.
|
88 |
+
|
89 |
+
|
90 |
+
|
91 |
+
## Model Description
|
92 |
+
|
93 |
+
- Developed by [ETRI Visual Intelligence Lab](https://huggingface.co/etri-vilab)
|
94 |
+
- Developer: [Youngwan Lee](https://youngwanlee.github.io/), [Kwanyong Park](https://pkyong95.github.io/), [Yoorhim Cho](https://ofzlo.github.io/), [Young-Ju Lee](https://scholar.google.com/citations?user=6goOQh8AAAAJ&hl=en), [Sung Ju Hwang](http://www.sungjuhwang.com/)
|
95 |
+
- Model Description: Latent Diffusion based text-to-image generative model. KOALA models uses the same text encoders as [SDXL-Base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and only replace the denoising U-Net with the compressed U-Nets.
|
96 |
+
- Training data: [LAION-aesthetics-V2 6+](https://laion.ai/blog/laion-aesthetics/)
|
97 |
+
- Resources for more information: Check out [KOALA report on arXiv](https://arxiv.org/abs/2312.04005) and [project page](https://youngwanlee.github.io/KOALA/).
|
98 |
+
|
99 |
+
|
100 |
+
|
101 |
+
|
102 |
+
## Usage with 🤗[Diffusers library](https://github.com/huggingface/diffusers)
|
103 |
+
The inference code with denoising step 25
|
104 |
+
```python
|
105 |
+
import torch
|
106 |
+
from diffusers import StableDiffusionXLPipeline
|
107 |
+
|
108 |
+
pipe = StableDiffusionXLPipeline.from_pretrained("etri-vilab/koala-700m", torch_dtype=torch.float16)
|
109 |
+
pipe = pipe.to("cuda")
|
110 |
+
|
111 |
+
prompt = "A portrait painting of a Golden Retriever like Leonard da Vinci"
|
112 |
+
negative = "worst quality, low quality, illustration, low resolution"
|
113 |
+
image = pipe(prompt=prompt, negative_prompt=negative).images[0]
|
114 |
+
```
|
115 |
+
|
116 |
+
|
117 |
+
|
118 |
+
## Uses
|
119 |
+
### Direct Use
|
120 |
+
The model is intended for research purposes only. Possible research areas and tasks include
|
121 |
+
|
122 |
+
- Generation of artworks and use in design and other artistic processes.
|
123 |
+
- Applications in educational or creative tools.
|
124 |
+
- Research on generative models.
|
125 |
+
- Safe deployment of models which have the potential to generate harmful content.
|
126 |
+
- Probing and understanding the limitations and biases of generative models.
|
127 |
+
- Excluded uses are described below.
|
128 |
+
|
129 |
+
### Out-of-Scope Use
|
130 |
+
|
131 |
+
The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
|
132 |
+
|
133 |
+
|
134 |
+
## Limitations and Bias
|
135 |
+
- Text Rendering: The models face challenges in rendering long, legible text within images.
|
136 |
+
- Complex Prompts: KOALA sometimes struggles with complex prompts involving multiple attributes.
|
137 |
+
- Dataset Dependencies: The current limitations are partially attributed to the characteristics of the training dataset (LAION-aesthetics-V2 6+).
|
138 |
+
|
139 |
+
|
140 |
+
|
141 |
+
## Citation
|
142 |
+
```bibtex
|
143 |
+
@misc{Lee@koala,
|
144 |
+
title={KOALA: Self-Attention Matters in Knowledge Distillation of Latent Diffusion Models for Memory-Efficient and Fast Image Synthesis},
|
145 |
+
author={Youngwan Lee and Kwanyong Park and Yoorhim Cho and Yong-Ju Lee and Sung Ju Hwang},
|
146 |
+
year={2023},
|
147 |
+
eprint={2312.04005},
|
148 |
+
archivePrefix={arXiv},
|
149 |
+
primaryClass={cs.CV}
|
150 |
+
}
|
151 |
+
```
|