Update README.md
Browse files
README.md
CHANGED
@@ -16,7 +16,7 @@ license: mit
|
|
16 |
|
17 |
## Model Description
|
18 |
|
19 |
-
A series of CLIP [ConvNeXt-Large](https://arxiv.org/abs/2201.03545) (w/ extra text depth, vision MLP head) models trained on
|
20 |
|
21 |
Goals:
|
22 |
* Explore an alternative to ViT and ResNet (w/ AttentionPooling) CLIP models that scales well with model size and image resolution
|
@@ -34,10 +34,11 @@ The models are trained at 256x256 (working on 384 variants) image resolution.
|
|
34 |
|
35 |
At 256x256, the ConvNext-Large-D used roughly 1/2 the training FLOPs to achieve accuracy greater than previous L/14 model trained on LAION-2B. L/14 model is ~1.65x more GMAC, 1.45x more activations, and 1.22x more parameters. The ConvNeXt was trained with 26B samples-seen and L/14 with 34B.
|
36 |
|
37 |
-
|
38 |
| Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
|
39 |
| ----- | ------- | ---------- | ------------ | --------- |
|
40 |
| [convnext_large_d.laion2b_s26b_b102k-augreg](https://huggingface.co/laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg) | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 75.9 |
|
|
|
|
|
41 |
|
42 |
RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only, D = Dropout (prob) -- image tower head only
|
43 |
|
@@ -101,6 +102,7 @@ For 256x256 models, a slurm script w/ srun below was used on 16 8-GPU (A100 80GB
|
|
101 |
--batch-size=800 \
|
102 |
--epochs=128 \
|
103 |
--dataset-resampled \
|
|
|
104 |
--clip-grad-norm 5.0 \
|
105 |
--lr 1.667e-3 \
|
106 |
--workers=6 \
|
|
|
16 |
|
17 |
## Model Description
|
18 |
|
19 |
+
A series of CLIP [ConvNeXt-Large](https://arxiv.org/abs/2201.03545) (w/ extra text depth, vision MLP head) models trained on LAION-2B (english), a subset of [LAION-5B](https://arxiv.org/abs/2210.08402), using [OpenCLIP](https://github.com/mlfoundations/open_clip).
|
20 |
|
21 |
Goals:
|
22 |
* Explore an alternative to ViT and ResNet (w/ AttentionPooling) CLIP models that scales well with model size and image resolution
|
|
|
34 |
|
35 |
At 256x256, the ConvNext-Large-D used roughly 1/2 the training FLOPs to achieve accuracy greater than previous L/14 model trained on LAION-2B. L/14 model is ~1.65x more GMAC, 1.45x more activations, and 1.22x more parameters. The ConvNeXt was trained with 26B samples-seen and L/14 with 34B.
|
36 |
|
|
|
37 |
| Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
|
38 |
| ----- | ------- | ---------- | ------------ | --------- |
|
39 |
| [convnext_large_d.laion2b_s26b_b102k-augreg](https://huggingface.co/laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg) | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 75.9 |
|
40 |
+
| [convnext_large_d_320.laion2b_s29b_b131k-ft](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft) | LAION-2B | 320x320 | RRC (0.5, 1.0), RE (0.4), SD (0.1), D(0.0) | 76.6 |
|
41 |
+
| [convnext_large_d_320.laion2b_s29b_b131k-ft-soup](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup) | LAION-2B | 320x320 | RRC (0.5, 1.0), RE (0.4), SD (0.1), D(0.0) | 76.9 |
|
42 |
|
43 |
RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only, D = Dropout (prob) -- image tower head only
|
44 |
|
|
|
102 |
--batch-size=800 \
|
103 |
--epochs=128 \
|
104 |
--dataset-resampled \
|
105 |
+
--aug-cfg use_timm=True scale='(0.33, 1.0)' re_prob=0.35 \
|
106 |
--clip-grad-norm 5.0 \
|
107 |
--lr 1.667e-3 \
|
108 |
--workers=6 \
|