InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Paper • 2407.03320 • Published 5 days ago • 82
Understanding Alignment in Multimodal LLMs: A Comprehensive Study Paper • 2407.02477 • Published 6 days ago • 17
TokenPacker: Efficient Visual Projector for Multimodal LLM Paper • 2407.02392 • Published 6 days ago • 18
Instruction Pre-Training: Language Models are Supervised Multitask Learners Paper • 2406.14491 • Published 18 days ago • 76
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences Paper • 2406.11069 • Published 22 days ago • 11
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published 21 days ago • 10
mDPO: Conditional Preference Optimization for Multimodal Large Language Models Paper • 2406.11839 • Published 21 days ago • 36
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs Paper • 2406.11833 • Published 21 days ago • 61
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark Paper • 2406.05967 • Published 28 days ago • 5
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models Paper • 2406.09403 • Published 25 days ago • 18
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text Paper • 2406.08418 • Published 26 days ago • 28
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Paper • 2406.09961 • Published 24 days ago • 54
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding Paper • 2406.09411 • Published 25 days ago • 17
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models Paper • 2406.08487 • Published 26 days ago • 10
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series Paper • 2405.19327 • Published May 29 • 43
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published May 24 • 43
PaliGemma Release Collection Pretrained and mix checkpoints for PaliGemma • 16 items • Updated 11 days ago • 119
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation Paper • 2404.19427 • Published Apr 30 • 70
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Paper • 2404.16821 • Published Apr 25 • 50
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model Paper • 2404.01331 • Published Mar 29 • 22
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Paper • 2403.05530 • Published Mar 8 • 51