Spaces:
Running
Running
Delete wuerstchen
Browse files- wuerstchen/.gitattributes +0 -35
- wuerstchen/README.md +0 -90
- wuerstchen/decoder/config.json +0 -43
- wuerstchen/decoder/diffusion_pytorch_model.bin +0 -3
- wuerstchen/decoder/diffusion_pytorch_model.safetensors +0 -3
- wuerstchen/model_index.json +0 -25
- wuerstchen/scheduler/scheduler_config.json +0 -6
- wuerstchen/text_encoder/config.json +0 -25
- wuerstchen/text_encoder/model.safetensors +0 -3
- wuerstchen/text_encoder/pytorch_model.bin +0 -3
- wuerstchen/tokenizer/merges.txt +0 -0
- wuerstchen/tokenizer/special_tokens_map.json +0 -24
- wuerstchen/tokenizer/tokenizer.json +0 -0
- wuerstchen/tokenizer/tokenizer_config.json +0 -33
- wuerstchen/tokenizer/vocab.json +0 -0
- wuerstchen/vqgan/config.json +0 -13
- wuerstchen/vqgan/diffusion_pytorch_model.bin +0 -3
- wuerstchen/vqgan/diffusion_pytorch_model.safetensors +0 -3
wuerstchen/.gitattributes
DELETED
@@ -1,35 +0,0 @@
|
|
1 |
-
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
-
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
-
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
-
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
-
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
-
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
-
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
-
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
-
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
-
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
-
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
-
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
-
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
-
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
-
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
-
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
-
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
-
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
-
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
-
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
-
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
-
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
-
*.pth filter=lfs diff=lfs merge=lfs -text
|
24 |
-
*.rar filter=lfs diff=lfs merge=lfs -text
|
25 |
-
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
-
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
-
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
-
*.tar filter=lfs diff=lfs merge=lfs -text
|
29 |
-
*.tflite filter=lfs diff=lfs merge=lfs -text
|
30 |
-
*.tgz filter=lfs diff=lfs merge=lfs -text
|
31 |
-
*.wasm filter=lfs diff=lfs merge=lfs -text
|
32 |
-
*.xz filter=lfs diff=lfs merge=lfs -text
|
33 |
-
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
-
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
-
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
wuerstchen/README.md
DELETED
@@ -1,90 +0,0 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
prior:
|
4 |
-
- warp-diffusion/wuerstchen-prior
|
5 |
-
tags:
|
6 |
-
- text-to-image
|
7 |
-
- wuerstchen
|
8 |
-
---
|
9 |
-
|
10 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/i-DYpDHw8Pwiy7QBKZVR5.jpeg" width=1500>
|
11 |
-
|
12 |
-
## Würstchen - Overview
|
13 |
-
Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce
|
14 |
-
computational costs for both training and inference by magnitudes. Training on 1024x1024 images, is way more expensive than training at 32x32. Usually, other works make
|
15 |
-
use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial
|
16 |
-
compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a
|
17 |
-
two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://arxiv.org/abs/2306.00637)).
|
18 |
-
A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, allowing
|
19 |
-
also cheaper and faster inference.
|
20 |
-
|
21 |
-
## Würstchen - Decoder
|
22 |
-
The Decoder is what we refer to as "Stage A" and "Stage B". The decoder takes in image embeddings, either generated by the Prior (Stage C) or extracted from a real image, and decodes those latents back into the pixel space. Specifically, Stage B first decodes the image embeddings into the VQGAN Space, and Stage A (which is a VQGAN)
|
23 |
-
decodes the latents into pixel space. Together, they achieve a spatial compression of 42.
|
24 |
-
|
25 |
-
**Note:** The reconstruction is lossy and loses information of the image. The current Stage B often lacks details in the reconstructions, which are especially noticeable to
|
26 |
-
us humans when looking at faces, hands, etc. We are working on making these reconstructions even better in the future!
|
27 |
-
|
28 |
-
### Image Sizes
|
29 |
-
Würstchen was trained on image resolutions between 1024x1024 & 1536x1536. We sometimes also observe good outputs at resolutions like 1024x2048. Feel free to try it out.
|
30 |
-
We also observed that the Prior (Stage C) adapts extremely fast to new resolutions. So finetuning it at 2048x2048 should be computationally cheap.
|
31 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/5pA5KUfGmvsObqiIjdGY1.jpeg" width=1000>
|
32 |
-
|
33 |
-
## How to run
|
34 |
-
This pipeline should be run together with a prior https://huggingface.co/warp-ai/wuerstchen-prior:
|
35 |
-
|
36 |
-
```py
|
37 |
-
import torch
|
38 |
-
from diffusers import AutoPipelineForText2Image
|
39 |
-
|
40 |
-
device = "cuda"
|
41 |
-
dtype = torch.float16
|
42 |
-
|
43 |
-
pipeline = AutoPipelineForText2Image.from_pretrained(
|
44 |
-
"warp-diffusion/wuerstchen", torch_dtype=dtype
|
45 |
-
).to(device)
|
46 |
-
|
47 |
-
caption = "Anthropomorphic cat dressed as a fire fighter"
|
48 |
-
|
49 |
-
output = pipeline(
|
50 |
-
prompt=caption,
|
51 |
-
height=1024,
|
52 |
-
width=1024,
|
53 |
-
prior_guidance_scale=4.0,
|
54 |
-
decoder_guidance_scale=0.0,
|
55 |
-
).images
|
56 |
-
```
|
57 |
-
|
58 |
-
### Image Sampling Times
|
59 |
-
The figure shows the inference times (on an A100) for different batch sizes (`num_images_per_prompt`) on Würstchen compared to [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) (without refiner).
|
60 |
-
The left figure shows inference times (using torch > 2.0), whereas the right figure applies `torch.compile` to both pipelines in advance.
|
61 |
-
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/UPhsIH2f079ZuTA_sLdVe.jpeg)
|
62 |
-
|
63 |
-
## Model Details
|
64 |
-
- **Developed by:** Pablo Pernias, Dominic Rampas
|
65 |
-
- **Model type:** Diffusion-based text-to-image generation model
|
66 |
-
- **Language(s):** English
|
67 |
-
- **License:** MIT
|
68 |
-
- **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a Diffusion model in the style of Stage C from the [Würstchen paper](https://arxiv.org/abs/2306.00637) that uses a fixed, pretrained text encoder ([CLIP ViT-bigG/14](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)).
|
69 |
-
- **Resources for more information:** [GitHub Repository](https://github.com/dome272/Wuerstchen), [Paper](https://arxiv.org/abs/2306.00637).
|
70 |
-
- **Cite as:**
|
71 |
-
|
72 |
-
@misc{pernias2023wuerstchen,
|
73 |
-
title={Wuerstchen: Efficient Pretraining of Text-to-Image Models},
|
74 |
-
author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher Pal and Marc Aubreville},
|
75 |
-
year={2023},
|
76 |
-
eprint={2306.00637},
|
77 |
-
archivePrefix={arXiv},
|
78 |
-
primaryClass={cs.CV}
|
79 |
-
}
|
80 |
-
|
81 |
-
## Environmental Impact
|
82 |
-
|
83 |
-
**Würstchen v2** **Estimated Emissions**
|
84 |
-
Based on that information, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.
|
85 |
-
|
86 |
-
- **Hardware Type:** A100 PCIe 40GB
|
87 |
-
- **Hours used:** 24602
|
88 |
-
- **Cloud Provider:** AWS
|
89 |
-
- **Compute Region:** US-east
|
90 |
-
- **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 2275.68 kg CO2 eq.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
wuerstchen/decoder/config.json
DELETED
@@ -1,43 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"_class_name": "WuerstchenDiffNeXt",
|
3 |
-
"_diffusers_version": "0.21.0.dev0",
|
4 |
-
"blocks": [
|
5 |
-
4,
|
6 |
-
4,
|
7 |
-
14,
|
8 |
-
4
|
9 |
-
],
|
10 |
-
"c_cond": 1024,
|
11 |
-
"c_hidden": [
|
12 |
-
320,
|
13 |
-
640,
|
14 |
-
1280,
|
15 |
-
1280
|
16 |
-
],
|
17 |
-
"c_in": 4,
|
18 |
-
"c_out": 4,
|
19 |
-
"c_r": 64,
|
20 |
-
"clip_embd": 1024,
|
21 |
-
"dropout": 0.1,
|
22 |
-
"effnet_embd": 16,
|
23 |
-
"inject_effnet": [
|
24 |
-
false,
|
25 |
-
true,
|
26 |
-
true,
|
27 |
-
true
|
28 |
-
],
|
29 |
-
"kernel_size": 3,
|
30 |
-
"level_config": [
|
31 |
-
"CT",
|
32 |
-
"CTA",
|
33 |
-
"CTA",
|
34 |
-
"CTA"
|
35 |
-
],
|
36 |
-
"nhead": [
|
37 |
-
-1,
|
38 |
-
10,
|
39 |
-
20,
|
40 |
-
20
|
41 |
-
],
|
42 |
-
"patch_size": 2
|
43 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
wuerstchen/decoder/diffusion_pytorch_model.bin
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:b2e99829fe0a2c946ec6b4ef6979aee78bfaa05f87b0cf7b80ecafa20272ef60
|
3 |
-
size 4221843094
|
|
|
|
|
|
|
|
wuerstchen/decoder/diffusion_pytorch_model.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:1510c2cc1a891df02d61d79866c40c506e9099519829e0282c2a79d7e9c7e66f
|
3 |
-
size 4221568336
|
|
|
|
|
|
|
|
wuerstchen/model_index.json
DELETED
@@ -1,25 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"_class_name": "WuerstchenDecoderPipeline",
|
3 |
-
"_diffusers_version": "0.21.0.dev0",
|
4 |
-
"decoder": [
|
5 |
-
"wuerstchen",
|
6 |
-
"WuerstchenDiffNeXt"
|
7 |
-
],
|
8 |
-
"latent_dim_scale": 10.67,
|
9 |
-
"scheduler": [
|
10 |
-
"diffusers",
|
11 |
-
"DDPMWuerstchenScheduler"
|
12 |
-
],
|
13 |
-
"text_encoder": [
|
14 |
-
"transformers",
|
15 |
-
"CLIPTextModel"
|
16 |
-
],
|
17 |
-
"tokenizer": [
|
18 |
-
"transformers",
|
19 |
-
"CLIPTokenizerFast"
|
20 |
-
],
|
21 |
-
"vqgan": [
|
22 |
-
"wuerstchen",
|
23 |
-
"PaellaVQModel"
|
24 |
-
]
|
25 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
wuerstchen/scheduler/scheduler_config.json
DELETED
@@ -1,6 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"_class_name": "DDPMWuerstchenScheduler",
|
3 |
-
"_diffusers_version": "0.21.0.dev0",
|
4 |
-
"s": 0.008,
|
5 |
-
"scaler": 1.0
|
6 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
wuerstchen/text_encoder/config.json
DELETED
@@ -1,25 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"_name_or_path": "laion/CLIP-ViT-H-14-laion2B-s32B-b79K",
|
3 |
-
"architectures": [
|
4 |
-
"CLIPTextModel"
|
5 |
-
],
|
6 |
-
"attention_dropout": 0.0,
|
7 |
-
"bos_token_id": 0,
|
8 |
-
"dropout": 0.0,
|
9 |
-
"eos_token_id": 2,
|
10 |
-
"hidden_act": "gelu",
|
11 |
-
"hidden_size": 1024,
|
12 |
-
"initializer_factor": 1.0,
|
13 |
-
"initializer_range": 0.02,
|
14 |
-
"intermediate_size": 4096,
|
15 |
-
"layer_norm_eps": 1e-05,
|
16 |
-
"max_position_embeddings": 77,
|
17 |
-
"model_type": "clip_text_model",
|
18 |
-
"num_attention_heads": 16,
|
19 |
-
"num_hidden_layers": 24,
|
20 |
-
"pad_token_id": 1,
|
21 |
-
"projection_dim": 1024,
|
22 |
-
"torch_dtype": "float32",
|
23 |
-
"transformers_version": "4.33.0.dev0",
|
24 |
-
"vocab_size": 49408
|
25 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
wuerstchen/text_encoder/model.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:bd94a7ea6922e8028227567fe14e04d2989eec31c482e0813e9006afea6637f1
|
3 |
-
size 1411983168
|
|
|
|
|
|
|
|
wuerstchen/text_encoder/pytorch_model.bin
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:0483b11b48b0f5a5079f778c0df4057d7b797cf58ef176087ec03a236d3e16e0
|
3 |
-
size 1412064410
|
|
|
|
|
|
|
|
wuerstchen/tokenizer/merges.txt
DELETED
The diff for this file is too large to render.
See raw diff
|
|
wuerstchen/tokenizer/special_tokens_map.json
DELETED
@@ -1,24 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"bos_token": {
|
3 |
-
"content": "<|startoftext|>",
|
4 |
-
"lstrip": false,
|
5 |
-
"normalized": true,
|
6 |
-
"rstrip": false,
|
7 |
-
"single_word": false
|
8 |
-
},
|
9 |
-
"eos_token": {
|
10 |
-
"content": "<|endoftext|>",
|
11 |
-
"lstrip": false,
|
12 |
-
"normalized": true,
|
13 |
-
"rstrip": false,
|
14 |
-
"single_word": false
|
15 |
-
},
|
16 |
-
"pad_token": "<|endoftext|>",
|
17 |
-
"unk_token": {
|
18 |
-
"content": "<|endoftext|>",
|
19 |
-
"lstrip": false,
|
20 |
-
"normalized": true,
|
21 |
-
"rstrip": false,
|
22 |
-
"single_word": false
|
23 |
-
}
|
24 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
wuerstchen/tokenizer/tokenizer.json
DELETED
The diff for this file is too large to render.
See raw diff
|
|
wuerstchen/tokenizer/tokenizer_config.json
DELETED
@@ -1,33 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"add_prefix_space": false,
|
3 |
-
"bos_token": {
|
4 |
-
"__type": "AddedToken",
|
5 |
-
"content": "<|startoftext|>",
|
6 |
-
"lstrip": false,
|
7 |
-
"normalized": true,
|
8 |
-
"rstrip": false,
|
9 |
-
"single_word": false
|
10 |
-
},
|
11 |
-
"clean_up_tokenization_spaces": true,
|
12 |
-
"do_lower_case": true,
|
13 |
-
"eos_token": {
|
14 |
-
"__type": "AddedToken",
|
15 |
-
"content": "<|endoftext|>",
|
16 |
-
"lstrip": false,
|
17 |
-
"normalized": true,
|
18 |
-
"rstrip": false,
|
19 |
-
"single_word": false
|
20 |
-
},
|
21 |
-
"errors": "replace",
|
22 |
-
"model_max_length": 77,
|
23 |
-
"pad_token": "<|endoftext|>",
|
24 |
-
"tokenizer_class": "CLIPTokenizer",
|
25 |
-
"unk_token": {
|
26 |
-
"__type": "AddedToken",
|
27 |
-
"content": "<|endoftext|>",
|
28 |
-
"lstrip": false,
|
29 |
-
"normalized": true,
|
30 |
-
"rstrip": false,
|
31 |
-
"single_word": false
|
32 |
-
}
|
33 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
wuerstchen/tokenizer/vocab.json
DELETED
The diff for this file is too large to render.
See raw diff
|
|
wuerstchen/vqgan/config.json
DELETED
@@ -1,13 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"_class_name": "PaellaVQModel",
|
3 |
-
"_diffusers_version": "0.21.0.dev0",
|
4 |
-
"bottleneck_blocks": 12,
|
5 |
-
"embed_dim": 384,
|
6 |
-
"in_channels": 3,
|
7 |
-
"latent_channels": 4,
|
8 |
-
"levels": 2,
|
9 |
-
"num_vq_embeddings": 8192,
|
10 |
-
"out_channels": 3,
|
11 |
-
"scale_factor": 0.3764,
|
12 |
-
"up_down_scale_factor": 2
|
13 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
wuerstchen/vqgan/diffusion_pytorch_model.bin
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:f3ab7752b474058d177e8565860367a438b8016ba788954394fbb7f1da16d6e1
|
3 |
-
size 73674142
|
|
|
|
|
|
|
|
wuerstchen/vqgan/diffusion_pytorch_model.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:052db8852c0d8b117e6d2a59ae3e0c7d7aaae3d00f247e392ef8e9837e11d6c4
|
3 |
-
size 73639568
|
|
|
|
|
|
|
|