File size: 5,383 Bytes
e610900 afb095a e610900 63d6f69 0568c73 401c54a 0568c73 5f19dde 0568c73 5f19dde 0568c73 5f19dde 0568c73 9736398 3300319 07f6f14 3300319 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
datasets:
- SPRIGHT-T2I/spright_coco
base_model: BeichenZhang/LongCLIP-L
---
## A fine-tune of Long-CLIP - original model: [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L)
- β€οΈ this CLIP? [Help feed it](https://ko-fi.com/zer0int) if you can. Besides data, CLIP eats time & expensive electricity of DE. TY! π€
- Want to feed it yourself? All code for fine-tuning and much more is on [my GitHub](https://github.com/zer0int).
----
- # Note for using Long-CLIP as the Text Encoder with Flux.1, SDXL, Stable Diffusion:
- Get the ComfyUI Long-CLIP nodes here: [https://github.com/SeaArtLab/ComfyUI-Long-CLIP](https://github.com/SeaArtLab/ComfyUI-Long-CLIP)
- If you don't use Comfy, it's at least a starting point for reverse engineering & applying it to your code! π€
----
# π¨ IMPORTANT NOTE for loading with HuggingFace Transformers: π
```
model_id = "zer0int/LongCLIP-GmP-ViT-L-14"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)
```
# β Error due to mismatch with defined 77 tokens in Transformers library
# π
# Option 1 (simple & worse):
Truncate to 77 tokens
`CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)`
```
# Cosine similarities for 77 tokens is WORSE:
# tensor[photo of a cat, picture of a dog, cat, dog] # image ground truth: cat photo
tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') π
```
# π
# Option 2, proper integration: π RECOMMENDED π
- ### Solution for implementation of 248 tokens / thanks [@kk3dmax ](https://huggingface.co/zer0int/LongCLIP-GmP-ViT-L-14/discussions/3) π€
- Obtain a full example script using this solution for Flux.1 inference on [my GitHub](https://github.com/zer0int/CLIP-txt2img-diffusers-scripts)
```
model_id = ("zer0int/LongCLIP-GmP-ViT-L-14")
config = CLIPConfig.from_pretrained(model_id)
config.text_config.max_position_embeddings = 248
clip_model = CLIPModel.from_pretrained(model_id, torch_dtype=dtype, config=config)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=248)
pipe.tokenizer = clip_processor.tokenizer # Replace with the CLIP tokenizer
pipe.text_encoder = clip_model.text_model # Replace with the CLIP text encoder
pipe.tokenizer_max_length = 248
pipe.text_encoder.dtype = torch.bfloat16
```
```
# Resulting Cosine Similarities for 248 tokens padded:
# tensor[photo of a cat, picture of a dog, cat, dog] -- image ground truth: cat photo
tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') β
```
----
## Update 12/AUG/2024:
New *BEST* model, custom loss with label smoothing.
Small gain for a diverse and large good quality dataset, but big relative gains for an overfit-prone fine-tune (small batch size, 1 GPU, narrow dataset of e.g. 'sneakers', etc.) are possible!
Fine-tune your model with the provided code for GmP-Smooth: [https://github.com/zer0int/Long-CLIP](https://github.com/zer0int/Long-CLIP)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/l3FYkaicihqXv5D9wLDAF.png)
----
The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors:~0.81)**.
Made possible with Geometric Parametrization (GmP):
```
"Normal" CLIP MLP (multi-layer perceptron):
(mlp): Sequential(
|-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
| (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| |
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
GmP CLIP MLP:
Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude
(mlp): Sequential(
|-(c_fc): GeometricLinear()
| (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| |
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
(Same thing for [text] transformer.resblocks)
```
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/OqhNxW-D9c58mkZyUQlL_.png)
β
The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder using [SeaArtLab/ComfyUI-Long-CLIP](https://github.com/SeaArtLab/ComfyUI-Long-CLIP) custom nodes! π€
** For details on training and those numbers / the eval, or for just fine-tuning the model yourself, see: [https://github.com/zer0int/Long-CLIP](https://github.com/zer0int/Long-CLIP)
```
@article{zhang2024longclip,
title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
journal={arXiv preprint arXiv:2403.15378},
year={2024}
}
```
Pre-trained CLIP model by OpenAI, License: [MIT License](https://github.com/openai/CLIP/blob/main/LICENSE) |