license: apache-2.0
Diffusion Feedback Helps CLIP See Better
Wenxuan Wang1,2,3*, Quan Sun3*, Fan Zhang3, Yepeng Tang4, Jing Liu1,2, Xinlong Wang3
In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (w/o corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7% ↑), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that DIVA preserves CLIP's strong zero-shot capabilities.
Model Zoo
Method | Image Size | Params (M) | Average Score |
---|---|---|---|
OpenAI ViT-L-14 | 224² | 427.6 | 25.9 (+6.6) |
OpenAI ViT-L-14 | 336² | 427.9 | 25.2 (+5.2) |
MetaCLIP ViT-L-14 | 224² | 427.6 | 27.4 (+3.7) |
MetaCLIP ViT-H-14 | 224² | 986.1 | 31.9 (+6.7) |
SigLIP ViT-SO-14 | 224² | 877.4 | 40.7 (+2.9) |
SigLIP ViT-SO-14 | 384² | 878.0 | 38.5 (+1.5) |
DFN ViT-H-14 | 224² | 986.1 | 43.7 (+4.4) |
DFN ViT-H-14 | 378² | 986.7 | 37.8 (+3.0) |
📝 Citation
If you find DIVA is helpful for your research, please consider citing📝our paper and give us a github star⭐:
@article{wang2024diffusion,
title={Diffusion Feedback Helps CLIP See Better},
author={Wang, Wenxuan and Sun, Quan and Zhang, Fan and Tang, Yepeng and Liu, Jing and Wang, Xinlong},
journal={arXiv preprint arXiv:2407.20171},
year={2024}
}