rwightman HF staff commited on
Commit
82ecff2
1 Parent(s): ecfbd22

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -16,7 +16,7 @@ license: mit
16
 
17
  ## Model Description
18
 
19
- A series of CLIP [ConvNeXt-Large](https://arxiv.org/abs/2201.03545) (w/ extra text depth, vision MLP head) models trained on subsets [LAION-5B](https://arxiv.org/abs/2210.08402) using [OpenCLIP](https://github.com/mlfoundations/open_clip).
20
 
21
  Goals:
22
  * Explore an alternative to ViT and ResNet (w/ AttentionPooling) CLIP models that scales well with model size and image resolution
@@ -34,10 +34,11 @@ The models are trained at 256x256 (working on 384 variants) image resolution.
34
 
35
  At 256x256, the ConvNext-Large-D used roughly 1/2 the training FLOPs to achieve accuracy greater than previous L/14 model trained on LAION-2B. L/14 model is ~1.65x more GMAC, 1.45x more activations, and 1.22x more parameters. The ConvNeXt was trained with 26B samples-seen and L/14 with 34B.
36
 
37
-
38
  | Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
39
  | ----- | ------- | ---------- | ------------ | --------- |
40
  | [convnext_large_d.laion2b_s26b_b102k-augreg](https://huggingface.co/laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg) | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 75.9 |
 
 
41
 
42
  RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only, D = Dropout (prob) -- image tower head only
43
 
@@ -101,6 +102,7 @@ For 256x256 models, a slurm script w/ srun below was used on 16 8-GPU (A100 80GB
101
  --batch-size=800 \
102
  --epochs=128 \
103
  --dataset-resampled \
 
104
  --clip-grad-norm 5.0 \
105
  --lr 1.667e-3 \
106
  --workers=6 \
 
16
 
17
  ## Model Description
18
 
19
+ A series of CLIP [ConvNeXt-Large](https://arxiv.org/abs/2201.03545) (w/ extra text depth, vision MLP head) models trained on LAION-2B (english), a subset of [LAION-5B](https://arxiv.org/abs/2210.08402), using [OpenCLIP](https://github.com/mlfoundations/open_clip).
20
 
21
  Goals:
22
  * Explore an alternative to ViT and ResNet (w/ AttentionPooling) CLIP models that scales well with model size and image resolution
 
34
 
35
  At 256x256, the ConvNext-Large-D used roughly 1/2 the training FLOPs to achieve accuracy greater than previous L/14 model trained on LAION-2B. L/14 model is ~1.65x more GMAC, 1.45x more activations, and 1.22x more parameters. The ConvNeXt was trained with 26B samples-seen and L/14 with 34B.
36
 
 
37
  | Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
38
  | ----- | ------- | ---------- | ------------ | --------- |
39
  | [convnext_large_d.laion2b_s26b_b102k-augreg](https://huggingface.co/laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg) | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 75.9 |
40
+ | [convnext_large_d_320.laion2b_s29b_b131k-ft](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft) | LAION-2B | 320x320 | RRC (0.5, 1.0), RE (0.4), SD (0.1), D(0.0) | 76.6 |
41
+ | [convnext_large_d_320.laion2b_s29b_b131k-ft-soup](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup) | LAION-2B | 320x320 | RRC (0.5, 1.0), RE (0.4), SD (0.1), D(0.0) | 76.9 |
42
 
43
  RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only, D = Dropout (prob) -- image tower head only
44
 
 
102
  --batch-size=800 \
103
  --epochs=128 \
104
  --dataset-resampled \
105
+ --aug-cfg use_timm=True scale='(0.33, 1.0)' re_prob=0.35 \
106
  --clip-grad-norm 5.0 \
107
  --lr 1.667e-3 \
108
  --workers=6 \