ReferEverything: Towards Segmenting Everything We Can Speak of in Videos Paper • 2410.23287 • Published 25 days ago • 17
Can MLLMs Understand the Deep Implication Behind Chinese Images? Paper • 2410.13854 • Published Oct 17 • 8
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control Paper • 2410.13830 • Published Oct 17 • 23
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens Paper • 2410.13863 • Published Oct 17 • 35
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree Paper • 2410.16268 • Published Oct 21 • 65
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting Paper • 2410.17856 • Published Oct 23 • 49
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models Paper • 2410.17637 • Published Oct 23 • 34
Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions Paper • 2406.07502 • Published Jun 11 • 1