svjack commited on
Commit
7730485
1 Parent(s): 65a22d6

Delete wuerstchen

Browse files
wuerstchen/.gitattributes DELETED
@@ -1,35 +0,0 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
wuerstchen/README.md DELETED
@@ -1,90 +0,0 @@
1
- ---
2
- license: mit
3
- prior:
4
- - warp-diffusion/wuerstchen-prior
5
- tags:
6
- - text-to-image
7
- - wuerstchen
8
- ---
9
-
10
- <img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/i-DYpDHw8Pwiy7QBKZVR5.jpeg" width=1500>
11
-
12
- ## Würstchen - Overview
13
- Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce
14
- computational costs for both training and inference by magnitudes. Training on 1024x1024 images, is way more expensive than training at 32x32. Usually, other works make
15
- use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial
16
- compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a
17
- two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://arxiv.org/abs/2306.00637)).
18
- A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, allowing
19
- also cheaper and faster inference.
20
-
21
- ## Würstchen - Decoder
22
- The Decoder is what we refer to as "Stage A" and "Stage B". The decoder takes in image embeddings, either generated by the Prior (Stage C) or extracted from a real image, and decodes those latents back into the pixel space. Specifically, Stage B first decodes the image embeddings into the VQGAN Space, and Stage A (which is a VQGAN)
23
- decodes the latents into pixel space. Together, they achieve a spatial compression of 42.
24
-
25
- **Note:** The reconstruction is lossy and loses information of the image. The current Stage B often lacks details in the reconstructions, which are especially noticeable to
26
- us humans when looking at faces, hands, etc. We are working on making these reconstructions even better in the future!
27
-
28
- ### Image Sizes
29
- Würstchen was trained on image resolutions between 1024x1024 & 1536x1536. We sometimes also observe good outputs at resolutions like 1024x2048. Feel free to try it out.
30
- We also observed that the Prior (Stage C) adapts extremely fast to new resolutions. So finetuning it at 2048x2048 should be computationally cheap.
31
- <img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/5pA5KUfGmvsObqiIjdGY1.jpeg" width=1000>
32
-
33
- ## How to run
34
- This pipeline should be run together with a prior https://huggingface.co/warp-ai/wuerstchen-prior:
35
-
36
- ```py
37
- import torch
38
- from diffusers import AutoPipelineForText2Image
39
-
40
- device = "cuda"
41
- dtype = torch.float16
42
-
43
- pipeline = AutoPipelineForText2Image.from_pretrained(
44
- "warp-diffusion/wuerstchen", torch_dtype=dtype
45
- ).to(device)
46
-
47
- caption = "Anthropomorphic cat dressed as a fire fighter"
48
-
49
- output = pipeline(
50
- prompt=caption,
51
- height=1024,
52
- width=1024,
53
- prior_guidance_scale=4.0,
54
- decoder_guidance_scale=0.0,
55
- ).images
56
- ```
57
-
58
- ### Image Sampling Times
59
- The figure shows the inference times (on an A100) for different batch sizes (`num_images_per_prompt`) on Würstchen compared to [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) (without refiner).
60
- The left figure shows inference times (using torch > 2.0), whereas the right figure applies `torch.compile` to both pipelines in advance.
61
- ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/UPhsIH2f079ZuTA_sLdVe.jpeg)
62
-
63
- ## Model Details
64
- - **Developed by:** Pablo Pernias, Dominic Rampas
65
- - **Model type:** Diffusion-based text-to-image generation model
66
- - **Language(s):** English
67
- - **License:** MIT
68
- - **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a Diffusion model in the style of Stage C from the [Würstchen paper](https://arxiv.org/abs/2306.00637) that uses a fixed, pretrained text encoder ([CLIP ViT-bigG/14](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)).
69
- - **Resources for more information:** [GitHub Repository](https://github.com/dome272/Wuerstchen), [Paper](https://arxiv.org/abs/2306.00637).
70
- - **Cite as:**
71
-
72
- @misc{pernias2023wuerstchen,
73
- title={Wuerstchen: Efficient Pretraining of Text-to-Image Models},
74
- author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher Pal and Marc Aubreville},
75
- year={2023},
76
- eprint={2306.00637},
77
- archivePrefix={arXiv},
78
- primaryClass={cs.CV}
79
- }
80
-
81
- ## Environmental Impact
82
-
83
- **Würstchen v2** **Estimated Emissions**
84
- Based on that information, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.
85
-
86
- - **Hardware Type:** A100 PCIe 40GB
87
- - **Hours used:** 24602
88
- - **Cloud Provider:** AWS
89
- - **Compute Region:** US-east
90
- - **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 2275.68 kg CO2 eq.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
wuerstchen/decoder/config.json DELETED
@@ -1,43 +0,0 @@
1
- {
2
- "_class_name": "WuerstchenDiffNeXt",
3
- "_diffusers_version": "0.21.0.dev0",
4
- "blocks": [
5
- 4,
6
- 4,
7
- 14,
8
- 4
9
- ],
10
- "c_cond": 1024,
11
- "c_hidden": [
12
- 320,
13
- 640,
14
- 1280,
15
- 1280
16
- ],
17
- "c_in": 4,
18
- "c_out": 4,
19
- "c_r": 64,
20
- "clip_embd": 1024,
21
- "dropout": 0.1,
22
- "effnet_embd": 16,
23
- "inject_effnet": [
24
- false,
25
- true,
26
- true,
27
- true
28
- ],
29
- "kernel_size": 3,
30
- "level_config": [
31
- "CT",
32
- "CTA",
33
- "CTA",
34
- "CTA"
35
- ],
36
- "nhead": [
37
- -1,
38
- 10,
39
- 20,
40
- 20
41
- ],
42
- "patch_size": 2
43
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
wuerstchen/decoder/diffusion_pytorch_model.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:b2e99829fe0a2c946ec6b4ef6979aee78bfaa05f87b0cf7b80ecafa20272ef60
3
- size 4221843094
 
 
 
 
wuerstchen/decoder/diffusion_pytorch_model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:1510c2cc1a891df02d61d79866c40c506e9099519829e0282c2a79d7e9c7e66f
3
- size 4221568336
 
 
 
 
wuerstchen/model_index.json DELETED
@@ -1,25 +0,0 @@
1
- {
2
- "_class_name": "WuerstchenDecoderPipeline",
3
- "_diffusers_version": "0.21.0.dev0",
4
- "decoder": [
5
- "wuerstchen",
6
- "WuerstchenDiffNeXt"
7
- ],
8
- "latent_dim_scale": 10.67,
9
- "scheduler": [
10
- "diffusers",
11
- "DDPMWuerstchenScheduler"
12
- ],
13
- "text_encoder": [
14
- "transformers",
15
- "CLIPTextModel"
16
- ],
17
- "tokenizer": [
18
- "transformers",
19
- "CLIPTokenizerFast"
20
- ],
21
- "vqgan": [
22
- "wuerstchen",
23
- "PaellaVQModel"
24
- ]
25
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
wuerstchen/scheduler/scheduler_config.json DELETED
@@ -1,6 +0,0 @@
1
- {
2
- "_class_name": "DDPMWuerstchenScheduler",
3
- "_diffusers_version": "0.21.0.dev0",
4
- "s": 0.008,
5
- "scaler": 1.0
6
- }
 
 
 
 
 
 
 
wuerstchen/text_encoder/config.json DELETED
@@ -1,25 +0,0 @@
1
- {
2
- "_name_or_path": "laion/CLIP-ViT-H-14-laion2B-s32B-b79K",
3
- "architectures": [
4
- "CLIPTextModel"
5
- ],
6
- "attention_dropout": 0.0,
7
- "bos_token_id": 0,
8
- "dropout": 0.0,
9
- "eos_token_id": 2,
10
- "hidden_act": "gelu",
11
- "hidden_size": 1024,
12
- "initializer_factor": 1.0,
13
- "initializer_range": 0.02,
14
- "intermediate_size": 4096,
15
- "layer_norm_eps": 1e-05,
16
- "max_position_embeddings": 77,
17
- "model_type": "clip_text_model",
18
- "num_attention_heads": 16,
19
- "num_hidden_layers": 24,
20
- "pad_token_id": 1,
21
- "projection_dim": 1024,
22
- "torch_dtype": "float32",
23
- "transformers_version": "4.33.0.dev0",
24
- "vocab_size": 49408
25
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
wuerstchen/text_encoder/model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:bd94a7ea6922e8028227567fe14e04d2989eec31c482e0813e9006afea6637f1
3
- size 1411983168
 
 
 
 
wuerstchen/text_encoder/pytorch_model.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:0483b11b48b0f5a5079f778c0df4057d7b797cf58ef176087ec03a236d3e16e0
3
- size 1412064410
 
 
 
 
wuerstchen/tokenizer/merges.txt DELETED
The diff for this file is too large to render. See raw diff
 
wuerstchen/tokenizer/special_tokens_map.json DELETED
@@ -1,24 +0,0 @@
1
- {
2
- "bos_token": {
3
- "content": "<|startoftext|>",
4
- "lstrip": false,
5
- "normalized": true,
6
- "rstrip": false,
7
- "single_word": false
8
- },
9
- "eos_token": {
10
- "content": "<|endoftext|>",
11
- "lstrip": false,
12
- "normalized": true,
13
- "rstrip": false,
14
- "single_word": false
15
- },
16
- "pad_token": "<|endoftext|>",
17
- "unk_token": {
18
- "content": "<|endoftext|>",
19
- "lstrip": false,
20
- "normalized": true,
21
- "rstrip": false,
22
- "single_word": false
23
- }
24
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
wuerstchen/tokenizer/tokenizer.json DELETED
The diff for this file is too large to render. See raw diff
 
wuerstchen/tokenizer/tokenizer_config.json DELETED
@@ -1,33 +0,0 @@
1
- {
2
- "add_prefix_space": false,
3
- "bos_token": {
4
- "__type": "AddedToken",
5
- "content": "<|startoftext|>",
6
- "lstrip": false,
7
- "normalized": true,
8
- "rstrip": false,
9
- "single_word": false
10
- },
11
- "clean_up_tokenization_spaces": true,
12
- "do_lower_case": true,
13
- "eos_token": {
14
- "__type": "AddedToken",
15
- "content": "<|endoftext|>",
16
- "lstrip": false,
17
- "normalized": true,
18
- "rstrip": false,
19
- "single_word": false
20
- },
21
- "errors": "replace",
22
- "model_max_length": 77,
23
- "pad_token": "<|endoftext|>",
24
- "tokenizer_class": "CLIPTokenizer",
25
- "unk_token": {
26
- "__type": "AddedToken",
27
- "content": "<|endoftext|>",
28
- "lstrip": false,
29
- "normalized": true,
30
- "rstrip": false,
31
- "single_word": false
32
- }
33
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
wuerstchen/tokenizer/vocab.json DELETED
The diff for this file is too large to render. See raw diff
 
wuerstchen/vqgan/config.json DELETED
@@ -1,13 +0,0 @@
1
- {
2
- "_class_name": "PaellaVQModel",
3
- "_diffusers_version": "0.21.0.dev0",
4
- "bottleneck_blocks": 12,
5
- "embed_dim": 384,
6
- "in_channels": 3,
7
- "latent_channels": 4,
8
- "levels": 2,
9
- "num_vq_embeddings": 8192,
10
- "out_channels": 3,
11
- "scale_factor": 0.3764,
12
- "up_down_scale_factor": 2
13
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
wuerstchen/vqgan/diffusion_pytorch_model.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:f3ab7752b474058d177e8565860367a438b8016ba788954394fbb7f1da16d6e1
3
- size 73674142
 
 
 
 
wuerstchen/vqgan/diffusion_pytorch_model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:052db8852c0d8b117e6d2a59ae3e0c7d7aaae3d00f247e392ef8e9837e11d6c4
3
- size 73639568