-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 38 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19
Collections
Discover the best community collections!
Collections including paper arxiv:2404.09204
-
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper • 2310.16045 • Published • 14 -
SILC: Improving Vision Language Pretraining with Self-Distillation
Paper • 2310.13355 • Published • 7 -
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Paper • 2311.07574 • Published • 14 -
MyVLM: Personalizing VLMs for User-Specific Queries
Paper • 2403.14599 • Published • 15
-
Improved Baselines with Visual Instruction Tuning
Paper • 2310.03744 • Published • 37 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper • 2403.05525 • Published • 39 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 7 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper • 2404.01331 • Published • 25
-
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
Paper • 2404.09990 • Published • 12 -
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization
Paper • 2404.09956 • Published • 11 -
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
Paper • 2404.09204 • Published • 10 -
Taming Latent Diffusion Model for Neural Radiance Field Inpainting
Paper • 2404.09995 • Published • 6
-
DiJiang: Efficient Large Language Models through Compact Kernelization
Paper • 2403.19928 • Published • 10 -
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Paper • 2309.12284 • Published • 18 -
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
Paper • 2404.09204 • Published • 10 -
SAGS: Structure-Aware 3D Gaussian Splatting
Paper • 2404.19149 • Published • 13
-
TextCraftor: Your Text Encoder Can be Image Quality Controller
Paper • 2403.18978 • Published • 13 -
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation
Paper • 2404.02733 • Published • 20 -
OmniFusion Technical Report
Paper • 2404.06212 • Published • 74 -
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Paper • 2404.07448 • Published • 11
-
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Paper • 2403.12943 • Published • 14 -
Masked Audio Generation using a Single Non-Autoregressive Transformer
Paper • 2401.04577 • Published • 42 -
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models
Paper • 2404.02747 • Published • 11 -
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation
Paper • 2404.02733 • Published • 20