-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 38 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19
Collections
Discover the best community collections!
Collections including paper arxiv:2407.08583
-
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
Paper • 2407.21770 • Published • 22 -
VILA^2: VILA Augmented VILA
Paper • 2407.17453 • Published • 38 -
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
Paper • 2407.08583 • Published • 10 -
Vision language models are blind
Paper • 2407.06581 • Published • 82
-
DataComp: In search of the next generation of multimodal datasets
Paper • 2304.14108 • Published • 2 -
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
Paper • 2407.08583 • Published • 10 -
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
Paper • 2411.04709 • Published • 20 -
YFCC100M: The New Data in Multimedia Research
Paper • 1503.01817 • Published • 1
-
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper • 2402.14289 • Published • 19 -
ImageBind: One Embedding Space To Bind Them All
Paper • 2305.05665 • Published • 3 -
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 181 -
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Paper • 2206.02770 • Published • 3