-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 38 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19
Collections
Discover the best community collections!
Collections including paper arxiv:2406.08478
-
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Paper • 2406.09415 • Published • 50 -
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper • 2406.09406 • Published • 13 -
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Paper • 2406.10227 • Published • 9 -
What If We Recaption Billions of Web Images with LLaMA-3?
Paper • 2406.08478 • Published • 39
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 12 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 53 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 85 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 31
-
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Paper • 2404.19752 • Published • 22 -
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 53 -
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper • 2403.07508 • Published • 75 -
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 124
-
World Model on Million-Length Video And Language With RingAttention
Paper • 2402.08268 • Published • 37 -
Improving Text Embeddings with Large Language Models
Paper • 2401.00368 • Published • 79 -
Chain-of-Thought Reasoning Without Prompting
Paper • 2402.10200 • Published • 100 -
FiT: Flexible Vision Transformer for Diffusion Model
Paper • 2402.12376 • Published • 48