VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information Paper • 2412.00947 • Published 3 days ago • 6
VLSBench: Unveiling Visual Leakage in Multimodal Safety Paper • 2411.19939 • Published 5 days ago • 6
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters Paper • 2412.00174 • Published 5 days ago • 13
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models Paper • 2412.01822 • Published 1 day ago • 10
Open-Sora Plan: Open-Source Large Video Generation Model Paper • 2412.00131 • Published 6 days ago • 22
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models Paper • 2412.01824 • Published 1 day ago • 50
TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video Paper • 2411.18671 • Published 7 days ago • 14
On Domain-Specific Post-Training for Multimodal Large Language Models Paper • 2411.19930 • Published 5 days ago • 23
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning Paper • 2411.18203 • Published 7 days ago • 26
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models Paper • 2411.18613 • Published 7 days ago • 42
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient Paper • 2411.17787 • Published 8 days ago • 11
ROICtrl: Boosting Instance Control for Visual Generation Paper • 2411.17949 • Published 7 days ago • 77
SketchAgent: Language-Driven Sequential Sketch Generation Paper • 2411.17673 • Published 8 days ago • 14
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs Paper • 2411.15296 • Published 12 days ago • 18
Star Attention: Efficient LLM Inference over Long Sequences Paper • 2411.17116 • Published 8 days ago • 42
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration Paper • 2411.17686 • Published 8 days ago • 18