Update README.md
Browse files
README.md
CHANGED
@@ -10,11 +10,12 @@ tags:
|
|
10 |
- generative model
|
11 |
---
|
12 |
|
13 |
-
UniDiffuser is a
|
|
|
14 |
|
15 |
|
16 |
|
17 |
-
|
18 |
|
19 |
|
20 |
We provide two versions of UniDiffuser:
|
|
|
10 |
- generative model
|
11 |
---
|
12 |
|
13 |
+
UniDiffuser is a unified diffusion framework to fit all distributions relevant to a set of multi-modal data in one transformer.
|
14 |
+
UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead.
|
15 |
|
16 |
|
17 |
|
18 |
+
Specifically, UniDiffuser employs a variation of transformer, called [U-ViT](https://github.com/baofff/U-ViT), which parameterizes the joint noise prediction network. Other components perform as encoders and decoders of different modalities, including a pretrained image autoencoder from [Stable Diffusion](https://github.com/CompVis/stable-diffusion), a pretrained [image ViT-B/32 CLIP encoder](https://github.com/openai/CLIP), a pretrained [text ViT-L CLIP encoder](https://huggingface.co/openai/clip-vit-large-patch14), and a [GPT-2](https://github.com/openai/gpt-2) text decoder finetuned by ourselves.
|
19 |
|
20 |
|
21 |
We provide two versions of UniDiffuser:
|