-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 36 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19
Collections
Discover the best community collections!
Collections including paper arxiv:2403.18814
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 178 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 14 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 45 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 40
-
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Paper • 2404.19752 • Published • 22 -
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 53 -
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper • 2403.07508 • Published • 75 -
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 124
-
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper • 2310.16045 • Published • 14 -
SILC: Improving Vision Language Pretraining with Self-Distillation
Paper • 2310.13355 • Published • 6 -
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Paper • 2311.07574 • Published • 14 -
MyVLM: Personalizing VLMs for User-Specific Queries
Paper • 2403.14599 • Published • 15
-
Improved Baselines with Visual Instruction Tuning
Paper • 2310.03744 • Published • 36 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper • 2403.05525 • Published • 39 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 6 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper • 2404.01331 • Published • 24
-
Jamba: A Hybrid Transformer-Mamba Language Model
Paper • 2403.19887 • Published • 103 -
sDPO: Don't Use Your Data All at Once
Paper • 2403.19270 • Published • 38 -
ViTAR: Vision Transformer with Any Resolution
Paper • 2403.18361 • Published • 51 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 44
-
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Paper • 2309.10020 • Published • 40 -
Language as the Medium: Multimodal Video Classification through text only
Paper • 2309.10783 • Published • 1 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 44 -
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 52