laion
/

CLIP-convnext_large_d.laion2B-s26B-b102K-augreg

Zero-Shot Image Classification

OpenCLIP

TensorBoard

Safetensors

clip

Model card Files Files and versions Metrics Training metrics Community

rwightman HF staff commited on Feb 13, 2023

Commit

82ecff2

•

1 Parent(s): ecfbd22

Update README.md

Browse files

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -16,7 +16,7 @@ license: mit
 ## Model Description
-A series of CLIP [ConvNeXt-Large](https://arxiv.org/abs/2201.03545) (w/ extra text depth, vision MLP head) models trained on subsets [LAION-5B](https://arxiv.org/abs/2210.08402) using [OpenCLIP](https://github.com/mlfoundations/open_clip).
 Goals:
   * Explore an alternative to ViT and ResNet (w/ AttentionPooling) CLIP models that scales well with model size and image resolution
@@ -34,10 +34,11 @@ The models are trained at 256x256 (working on 384 variants) image resolution.
 At 256x256, the ConvNext-Large-D used roughly 1/2 the training FLOPs to achieve accuracy greater than previous L/14 model trained on LAION-2B. L/14 model is ~1.65x more GMAC, 1.45x more activations, and 1.22x more parameters. The ConvNeXt was trained with 26B samples-seen and L/14 with 34B.
 | Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
 | ----- | ------- | ---------- | ------------ | --------- |
 | [convnext_large_d.laion2b_s26b_b102k-augreg](https://huggingface.co/laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg) | LAION-2B | 256x256 |  RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 75.9 |
 RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only, D = Dropout (prob) -- image tower head only
@@ -101,6 +102,7 @@ For 256x256 models, a slurm script w/ srun below was used on 16 8-GPU (A100 80GB
     --batch-size=800 \
     --epochs=128 \
     --dataset-resampled \
     --clip-grad-norm 5.0 \
     --lr 1.667e-3 \
     --workers=6 \

 ## Model Description
+A series of CLIP [ConvNeXt-Large](https://arxiv.org/abs/2201.03545) (w/ extra text depth, vision MLP head) models trained on LAION-2B (english), a subset of [LAION-5B](https://arxiv.org/abs/2210.08402), using [OpenCLIP](https://github.com/mlfoundations/open_clip).
 Goals:
   * Explore an alternative to ViT and ResNet (w/ AttentionPooling) CLIP models that scales well with model size and image resolution
 At 256x256, the ConvNext-Large-D used roughly 1/2 the training FLOPs to achieve accuracy greater than previous L/14 model trained on LAION-2B. L/14 model is ~1.65x more GMAC, 1.45x more activations, and 1.22x more parameters. The ConvNeXt was trained with 26B samples-seen and L/14 with 34B.
 | Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
 | ----- | ------- | ---------- | ------------ | --------- |
 | [convnext_large_d.laion2b_s26b_b102k-augreg](https://huggingface.co/laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg) | LAION-2B | 256x256 |  RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 75.9 |
+| [convnext_large_d_320.laion2b_s29b_b131k-ft](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft) | LAION-2B | 320x320 |  RRC (0.5, 1.0), RE (0.4), SD (0.1), D(0.0) | 76.6 |
+| [convnext_large_d_320.laion2b_s29b_b131k-ft-soup](https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup) | LAION-2B | 320x320 |  RRC (0.5, 1.0), RE (0.4), SD (0.1), D(0.0) | 76.9 |
 RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only, D = Dropout (prob) -- image tower head only
     --batch-size=800 \
     --epochs=128 \
     --dataset-resampled \
+    --aug-cfg use_timm=True scale='(0.33, 1.0)' re_prob=0.35 \
     --clip-grad-norm 5.0 \
     --lr 1.667e-3 \
     --workers=6 \