multimodalart HF staff

dn6 HF staff commited on Mar 14

Commit

7ca32c2

•

1 Parent(s): 621fc2d

Update diffusers weights (#2)

Browse files

- update diffusers weights (b3654c7d10ba81353f14532903c13a2f6589ff90)
- update model card (59b5b00272cf90867415ae5f238f4afb733f5d5f)
- update model card (03f16203a3632f0ea32f170823429a1835ff57ae)
- add bf16 weights (7944c89871ddb9d7364c98d4456f19e563768f38)
- update (22a78febb95e62e2299f39aadd1599d494323c9c)
- update (d8893f366792c3c1c511edbd4c269d6b1b336938)
- add lite version (3a1aa3824c01a4b53645dd74bfd4e2c5201e068f)
- update README (143869a0d0dc97bfaadddbfce89aa7b8f3cb0d62)
- update (36d61b4bfeb5c96a2742bef9897ea3ba56c3c5e4)
- update (7d224bdf5482443a9706855c72cbe57d36afd7bd)
- update (9ebea02e97b970b846ba2667c22a65824c2dce15)

Co-authored-by: Dhruv Nair <[email protected]>

Files changed (16) hide show

README.md +163 -46
image_encoder/config.json +2 -2
image_encoder/model.bf16.safetensors +3 -0
image_encoder/model.safetensors +2 -2
model_index.json +3 -4
prior/config.json +39 -36
prior/diffusion_pytorch_model.bf16.safetensors +3 -0
prior/diffusion_pytorch_model.safetensors +2 -2
prior_lite/config.json +64 -0
prior_lite/diffusion_pytorch_model.bf16.safetensors +3 -0
prior_lite/diffusion_pytorch_model.safetensors +3 -0
scheduler/scheduler_config.json +1 -1
text_encoder/config.json +2 -2
text_encoder/model.bf16.safetensors +3 -0
text_encoder/model.safetensors +2 -2
tokenizer/tokenizer.json +2 -16

README.md CHANGED Viewed

@@ -10,13 +10,13 @@ license_link: LICENSE
 <!-- Provide a quick summary of what the model is/does. -->
 <img src="figures/collage_1.jpg" width="800">
-This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main
-difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this
-important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes.
-How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being
-encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a
-1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the
-highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable
 Diffusion 1.5. <br> <br>
 Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
 like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.
@@ -41,65 +41,182 @@ For research purposes, we recommend our `StableCascade` Github repository (https
 ### Model Overview
 Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
 hence the name "Stable Cascade".
-Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion.
-However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a
-spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves
-a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the
-image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible
 for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually.
 <img src="figures/model-overview.jpg" width="600">
-For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with
-a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was
-put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve
-great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the
-best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to
 its small size.
 ## Evaluation
 <img height="300" src="figures/comparison.png"/>
-According to our evaluation, Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all
-comparisons. The above picture shows the results from a human evaluation using a mix of parti-prompts (link) and
-aesthetic prompts. Specifically, Stable Cascade (30 inference steps) was compared against Playground v2 (50 inference
 steps), SDXL (50 inference steps), SDXL Turbo (1 inference step) and Würstchen v2 (30 inference steps).
 ## Code Example
 ```shell
-#install `diffusers` from this branch while the PR is WIP
-pip install git+https://github.com/kashif/diffusers.git@wuerstchen-v3
 ```
 ```python
 import torch
 from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline
-device = "cuda"
-dtype = torch.bfloat16
-num_images_per_prompt = 2
-prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", torch_dtype=dtype).to(device)
-decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade",  torch_dtype=dtype).to(device)
-prompt = "Anthropomorphic cat dressed as a pilot"
 negative_prompt = ""
-with torch.cuda.amp.autocast(dtype=dtype):
-    prior_output = prior(
-        prompt=prompt,
-        height=1024,
-        width=1024,
-        negative_prompt=negative_prompt,
-        guidance_scale=4.0,
-        num_images_per_prompt=num_images_per_prompt,
-    )
-    decoder_output = decoder(
-        image_embeddings=prior_output.image_embeddings,
-        prompt=prompt,
-        negative_prompt=negative_prompt,
-        guidance_scale=0.0,
-        output_type="pil",
-    ).images
 ```
 ## Uses
@@ -118,7 +235,7 @@ Excluded uses are described below.
 ### Out-of-Scope Use
-The model was not trained to be factual or true representations of people or events,
 and therefore using the model to generate such content is out-of-scope for the abilities of this model.
 The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).
@@ -135,4 +252,4 @@ The model is intended for research purposes only.
 ## How to Get Started with the Model
-Check out https://github.com/Stability-AI/StableCascade

 <!-- Provide a quick summary of what the model is/does. -->
 <img src="figures/collage_1.jpg" width="800">
+This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main
+difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this
+important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes.
+How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being
+encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a
+1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the
+highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable
 Diffusion 1.5. <br> <br>
 Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions
 like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.
 ### Model Overview
 Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images,
 hence the name "Stable Cascade".
+Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion.
+However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a
+spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves
+a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the
+image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible
 for generating the small 24 x 24 latents given a text prompt. The following picture shows this visually.
 <img src="figures/model-overview.jpg" width="600">
+For this release, we are providing two checkpoints for Stage C, two for Stage B and one for Stage A. Stage C comes with
+a 1 billion and 3.6 billion parameter version, but we highly recommend using the 3.6 billion version, as most work was
+put into its finetuning. The two versions for Stage B amount to 700 million and 1.5 billion parameters. Both achieve
+great results, however the 1.5 billion excels at reconstructing small and fine details. Therefore, you will achieve the
+best results if you use the larger variant of each. Lastly, Stage A contains 20 million parameters and is fixed due to
 its small size.
 ## Evaluation
 <img height="300" src="figures/comparison.png"/>
+According to our evaluation, Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all
+comparisons. The above picture shows the results from a human evaluation using a mix of parti-prompts (link) and
+aesthetic prompts. Specifically, Stable Cascade (30 inference steps) was compared against Playground v2 (50 inference
 steps), SDXL (50 inference steps), SDXL Turbo (1 inference step) and Würstchen v2 (30 inference steps).
 ## Code Example
+**Note:** In order to use the `torch.bfloat16` data type with the `StableCascadeDecoderPipeline` you need to have PyTorch 2.2.0 or higher installed. This also means that using the `StableCascadeCombinedPipeline` with `torch.bfloat16` requires PyTorch 2.2.0 or higher, since it calls the StableCascadeDecoderPipeline internally.
+If it is not possible to install PyTorch 2.2.0 or higher in your environment, the `StableCascadeDecoderPipeline` can be used on its own with the torch.float16 data type. You can download the full precision or bf16 variant weights for the pipeline and cast the weights to torch.float16.
 ```shell
+pip install diffusers
 ```
 ```python
 import torch
 from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline
+prompt = "an image of a shiba inu, donning a spacesuit and helmet"
+negative_prompt = ""
+prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16)
+decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16)
+prior.enable_model_cpu_offload()
+prior_output = prior(
+    prompt=prompt,
+    height=1024,
+    width=1024,
+    negative_prompt=negative_prompt,
+    guidance_scale=4.0,
+    num_images_per_prompt=1,
+    num_inference_steps=20
+)
+decoder.enable_model_cpu_offload()
+decoder_output = decoder(
+    image_embeddings=prior_output.image_embeddings.to(torch.float16),
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    guidance_scale=0.0,
+    output_type="pil",
+    num_inference_steps=10
+).images[0]
+decoder_output.save("cascade.png")
+```
+### Using the Lite Version of the Stage B and Stage C models
+```python
+import torch
+from diffusers import (
+    StableCascadeDecoderPipeline,
+    StableCascadePriorPipeline,
+    StableCascadeUNet,
+)
+prompt = "an image of a shiba inu, donning a spacesuit and helmet"
+negative_prompt = ""
+prior_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade-prior", subfolder="prior_lite")
+decoder_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade", subfolder="decoder_lite")
+prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet)
+decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet)
+prior.enable_model_cpu_offload()
+prior_output = prior(
+    prompt=prompt,
+    height=1024,
+    width=1024,
+    negative_prompt=negative_prompt,
+    guidance_scale=4.0,
+    num_images_per_prompt=1,
+    num_inference_steps=20
+)
+decoder.enable_model_cpu_offload()
+decoder_output = decoder(
+    image_embeddings=prior_output.image_embeddings,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    guidance_scale=0.0,
+    output_type="pil",
+    num_inference_steps=10
+).images[0]
+decoder_output.save("cascade.png")
+```
+### Loading original checkpoints with `from_single_file`
+Loading the original format checkpoints is supported via `from_single_file` method in the StableCascadeUNet.
+```python
+import torch
+from diffusers import (
+    StableCascadeDecoderPipeline,
+    StableCascadePriorPipeline,
+    StableCascadeUNet,
+)
+prompt = "an image of a shiba inu, donning a spacesuit and helmet"
 negative_prompt = ""
+prior_unet = StableCascadeUNet.from_single_file(
+    "https://huggingface.co/stabilityai/stable-cascade/resolve/main/stage_c_bf16.safetensors",
+    torch_dtype=torch.bfloat16
+)
+decoder_unet = StableCascadeUNet.from_single_file(
+    "https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_bf16.safetensors",
+    torch_dtype=torch.bfloat16
+)
+prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet, torch_dtype=torch.bfloat16)
+decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet, torch_dtype=torch.bfloat16)
+prior.enable_model_cpu_offload()
+prior_output = prior(
+    prompt=prompt,
+    height=1024,
+    width=1024,
+    negative_prompt=negative_prompt,
+    guidance_scale=4.0,
+    num_images_per_prompt=1,
+    num_inference_steps=20
+)
+decoder.enable_model_cpu_offload()
+decoder_output = decoder(
+    image_embeddings=prior_output.image_embeddings,
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    guidance_scale=0.0,
+    output_type="pil",
+    num_inference_steps=10
+).images[0]
+decoder_output.save("cascade-single-file.png")
+```
+### Using the `StableCascadeCombinedPipeline`
+```python
+from diffusers import StableCascadeCombinedPipeline
+pipe = StableCascadeCombinedPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.bfloat16)
+prompt = "an image of a shiba inu, donning a spacesuit and helmet"
+output = pipe(
+    prompt=prompt,
+    negative_prompt="",
+    num_inference_steps=10,
+    prior_num_inference_steps=20,
+    prior_guidance_scale=3.0,
+    width=1024,
+    height=1024,
+)
+output.images[0].save("cascade-combined.png")
 ```
 ## Uses
 ### Out-of-Scope Use
+The model was not trained to be factual or true representations of people or events,
 and therefore using the model to generate such content is out-of-scope for the abilities of this model.
 The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy).
 ## How to Get Started with the Model
+Check out https://github.com/Stability-AI/StableCascade

image_encoder/config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "StableCascade-prior/image_encoder",
   "architectures": [
     "CLIPVisionModelWithProjection"
   ],
@@ -19,5 +19,5 @@
   "patch_size": 14,
   "projection_dim": 768,
   "torch_dtype": "bfloat16",
-  "transformers_version": "4.38.0.dev0"
 }

 {
+  "_name_or_path": "openai/clip-vit-large-patch14",
   "architectures": [
     "CLIPVisionModelWithProjection"
   ],
   "patch_size": 14,
   "projection_dim": 768,
   "torch_dtype": "bfloat16",
+  "transformers_version": "4.38.2"
 }

image_encoder/model.bf16.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e4b33d864f89a793357a768cb07d0dc18d6a14e6664f4110a0d535ca9ba78da8
+size 607980488

image_encoder/model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e4b33d864f89a793357a768cb07d0dc18d6a14e6664f4110a0d535ca9ba78da8
-size 607980488

 version https://git-lfs.github.com/spec/v1
+oid sha256:77b33d2a3a643650857672e880ccf73adbaf114fbbadec36d142ee9d48af7e20
+size 1215912728

model_index.json CHANGED Viewed

@@ -1,7 +1,6 @@
 {
   "_class_name": "StableCascadePriorPipeline",
-  "_diffusers_version": "0.26.0.dev0",
-  "_name_or_path": "StableCascade-prior/",
   "feature_extractor": [
     "transformers",
     "CLIPImageProcessor"
@@ -11,8 +10,8 @@
     "CLIPVisionModelWithProjection"
   ],
   "prior": [
-    "stable_cascade",
-    "StableCascadeUnet"
   ],
   "resolution_multiple": 42.67,
   "scheduler": [

 {
   "_class_name": "StableCascadePriorPipeline",
+  "_diffusers_version": "0.27.0.dev0",
   "feature_extractor": [
     "transformers",
     "CLIPImageProcessor"
     "CLIPVisionModelWithProjection"
   ],
   "prior": [
+    "diffusers",
+    "StableCascadeUNet"
   ],
   "resolution_multiple": 42.67,
   "scheduler": [

prior/config.json CHANGED Viewed

@@ -1,61 +1,64 @@
 {
-  "_class_name": "StableCascadeUnet",
-  "_diffusers_version": "0.26.0.dev0",
-  "_name_or_path": "StableCascade-prior/prior",
-  "block_repeat": [
-    [
-      1,
-      1
-    ],
-    [
-      1,
-      1
-    ]
   ],
-  "blocks": [
     [
-      8,
-      24
     ],
     [
-      24,
-      8
     ]
   ],
-  "c_clip_img": 768,
-  "c_clip_seq": 4,
-  "c_clip_text": 1280,
-  "c_clip_text_pooled": 1280,
-  "c_cond": 2048,
-  "c_effnet": null,
-  "c_hidden": [
-    2048,
-    2048
   ],
-  "c_in": 16,
-  "c_out": 16,
-  "c_pixels": null,
-  "c_r": 64,
   "dropout": [
     0.1,
     0.1
   ],
   "kernel_size": 3,
-  "level_config": [
-    "CTA",
-    "CTA"
-  ],
-  "nhead": [
     32,
     32
   ],
   "patch_size": 1,
   "self_attn": true,
   "switch_level": [
     false
   ],
-  "t_conds": [
     "sca",
     "crp"
   ]
 }

 {
+  "_class_name": "StableCascadeUNet",
+  "_diffusers_version": "0.27.0.dev0",
+  "block_out_channels": [
+    2048,
+    2048
   ],
+  "block_types_per_layer": [
     [
+      "SDCascadeResBlock",
+      "SDCascadeTimestepBlock",
+      "SDCascadeAttnBlock"
     ],
     [
+      "SDCascadeResBlock",
+      "SDCascadeTimestepBlock",
+      "SDCascadeAttnBlock"
     ]
   ],
+  "clip_image_in_channels": 768,
+  "clip_seq": 4,
+  "clip_text_in_channels": 1280,
+  "clip_text_pooled_in_channels": 1280,
+  "conditioning_dim": 2048,
+  "down_blocks_repeat_mappers": [
+    1,
+    1
+  ],
+  "down_num_layers_per_block": [
+    8,
+    24
   ],
   "dropout": [
     0.1,
     0.1
   ],
+  "effnet_in_channels": null,
+  "in_channels": 16,
   "kernel_size": 3,
+  "num_attention_heads": [
     32,
     32
   ],
+  "out_channels": 16,
   "patch_size": 1,
+  "pixel_mapper_in_channels": null,
   "self_attn": true,
   "switch_level": [
     false
   ],
+  "timestep_conditioning_type": [
     "sca",
     "crp"
+  ],
+  "timestep_ratio_embedding_dim": 64,
+  "up_blocks_repeat_mappers": [
+    1,
+    1
+  ],
+  "up_num_layers_per_block": [
+    24,
+    8
   ]
 }

prior/diffusion_pytorch_model.bf16.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:44a4cd9540f327f2fb4ac09179e4e87912a01cdb1b3b86c79f0f853976fb4c98
+size 7178377816

prior/diffusion_pytorch_model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:44a4cd9540f327f2fb4ac09179e4e87912a01cdb1b3b86c79f0f853976fb4c98
-size 7178377816

 version https://git-lfs.github.com/spec/v1
+oid sha256:0a2c7aa62c503780b85f74fd513b1b99c12ea4f83422bdbad5ac264aa68efb4b
+size 14356584672

prior_lite/config.json ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "_class_name": "StableCascadeUNet",
+  "_diffusers_version": "0.27.0.dev0",
+  "block_out_channels": [
+    1536,
+    1536
+  ],
+  "block_types_per_layer": [
+    [
+      "SDCascadeResBlock",
+      "SDCascadeTimestepBlock",
+      "SDCascadeAttnBlock"
+    ],
+    [
+      "SDCascadeResBlock",
+      "SDCascadeTimestepBlock",
+      "SDCascadeAttnBlock"
+    ]
+  ],
+  "clip_image_in_channels": 768,
+  "clip_seq": 4,
+  "clip_text_in_channels": 1280,
+  "clip_text_pooled_in_channels": 1280,
+  "conditioning_dim": 1536,
+  "down_blocks_repeat_mappers": [
+    1,
+    1
+  ],
+  "down_num_layers_per_block": [
+    4,
+    12
+  ],
+  "dropout": [
+    0.1,
+    0.1
+  ],
+  "effnet_in_channels": null,
+  "in_channels": 16,
+  "kernel_size": 3,
+  "num_attention_heads": [
+    24,
+    24
+  ],
+  "out_channels": 16,
+  "patch_size": 1,
+  "pixel_mapper_in_channels": null,
+  "self_attn": true,
+  "switch_level": [
+    false
+  ],
+  "timestep_conditioning_type": [
+    "sca",
+    "crp"
+  ],
+  "timestep_ratio_embedding_dim": 64,
+  "up_blocks_repeat_mappers": [
+    1,
+    1
+  ],
+  "up_num_layers_per_block": [
+    12,
+    4
+  ]
+}

prior_lite/diffusion_pytorch_model.bf16.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b1f1e7f429fe290bead3b044734a4aa21ad7e6ae4ed709fc85f65d8d7460190e
+size 2061655280

prior_lite/diffusion_pytorch_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:61a01756dcaecda654074624fd5f993dbe22c7f5cb0d08887416e6c594179a6a
+size 4123225040

scheduler/scheduler_config.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "_class_name": "DDPMWuerstchenScheduler",
-  "_diffusers_version": "0.26.0.dev0",
   "s": 0.008,
   "scaler": 1.0
 }

 {
   "_class_name": "DDPMWuerstchenScheduler",
+  "_diffusers_version": "0.27.0.dev0",
   "s": 0.008,
   "scaler": 1.0
 }

text_encoder/config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "StableCascade-prior/text_encoder",
   "architectures": [
     "CLIPTextModelWithProjection"
   ],
@@ -20,6 +20,6 @@
   "pad_token_id": 1,
   "projection_dim": 1280,
   "torch_dtype": "bfloat16",
-  "transformers_version": "4.38.0.dev0",
   "vocab_size": 49408
 }

 {
+  "_name_or_path": "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k",
   "architectures": [
     "CLIPTextModelWithProjection"
   ],
   "pad_token_id": 1,
   "projection_dim": 1280,
   "torch_dtype": "bfloat16",
+  "transformers_version": "4.38.2",
   "vocab_size": 49408
 }

text_encoder/model.bf16.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:260e0127aca3c89db813637ae659ebb822cb07af71fedc16cbd980e9518dfdcd
+size 1389382688

text_encoder/model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:260e0127aca3c89db813637ae659ebb822cb07af71fedc16cbd980e9518dfdcd
-size 1389382688

 version https://git-lfs.github.com/spec/v1
+oid sha256:fa5b2e6f4c2efc2d82e4b8312faec1a5540eabfc6415126c9a05c8436a530ef4
+size 2778702264

tokenizer/tokenizer.json CHANGED Viewed

@@ -1,21 +1,7 @@
 {
   "version": "1.0",
-  "truncation": {
-    "direction": "Right",
-    "max_length": 77,
-    "strategy": "LongestFirst",
-    "stride": 0
-  },
-  "padding": {
-    "strategy": {
-      "Fixed": 77
-    },
-    "direction": "Right",
-    "pad_to_multiple_of": null,
-    "pad_id": 49407,
-    "pad_type_id": 0,
-    "pad_token": "<|endoftext|>"
-  },
   "added_tokens": [
     {
       "id": 49406,

 {
   "version": "1.0",
+  "truncation": null,
+  "padding": null,
   "added_tokens": [
     {
       "id": 49406,