Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,62 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## A fine-tune of [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) -- Long-CLIP ViT-L/14 expanded to 248 tokens.
|
2 |
+
|
3 |
+
The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors:~0.81)**.
|
4 |
+
|
5 |
+
|
6 |
+
Made possible with Geometric Parametrization (GmP):
|
7 |
+
|
8 |
+
```
|
9 |
+
|
10 |
+
"Normal" CLIP MLP (multi-layer perceptron):
|
11 |
+
|
12 |
+
(mlp): Sequential(
|
13 |
+
|-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
|
14 |
+
| (gelu): QuickGELU()
|
15 |
+
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
|
16 |
+
| |
|
17 |
+
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
|
18 |
+
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
19 |
+
|
|
20 |
+
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|
21 |
+
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
|
22 |
+
|
23 |
+
|
24 |
+
GmP CLIP MLP:
|
25 |
+
|
26 |
+
Weight decomposition into:
|
27 |
+
- radial component 'r' as norm of pre-trained weights
|
28 |
+
- angular component 'theta' as normalized direction
|
29 |
+
-> preserves weight vectors' directionality and magnitude
|
30 |
+
|
31 |
+
(mlp): Sequential(
|
32 |
+
|-(c_fc): GeometricLinear()
|
33 |
+
| (gelu): QuickGELU()
|
34 |
+
|-}-(c_proj): GeometricLinear()
|
35 |
+
| |
|
36 |
+
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
|
37 |
+
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
|
38 |
+
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
39 |
+
|
|
40 |
+
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|
41 |
+
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|
42 |
+
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
|
43 |
+
|
44 |
+
(Same thing for [text] transformer.resblocks)
|
45 |
+
|
46 |
+
```
|
47 |
+
|
48 |
+
|
49 |
+
✅ The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder using [SeaArtLab/ComfyUI-Long-CLIP](https://github.com/SeaArtLab/ComfyUI-Long-CLIP) custom nodes! 🤗
|
50 |
+
|
51 |
+
** For details on training and those numbers / the eval, or for just fine-tuning the model yourself, see: [https://github.com/zer0int/Long-CLIP](https://github.com/zer0int/Long-CLIP)
|
52 |
+
|
53 |
+
```
|
54 |
+
@article{zhang2024longclip,
|
55 |
+
title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
|
56 |
+
author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
|
57 |
+
journal={arXiv preprint arXiv:2403.15378},
|
58 |
+
year={2024}
|
59 |
+
}
|
60 |
+
```
|
61 |
+
|
62 |
+
Pre-trained CLIP model by OpenAI, License: [MIT License](https://github.com/openai/CLIP/blob/main/LICENSE)
|