Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale Paper • 2409.08264 • Published Sep 12 • 43
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16 • 97
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents Paper • 2407.17490 • Published Jul 3 • 30
Understanding Alignment in Multimodal LLMs: A Comprehensive Study Paper • 2407.02477 • Published Jul 2 • 21
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence Paper • 2406.11931 • Published Jun 17 • 57
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model Paper • 2312.11370 • Published Dec 18, 2023 • 20
Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings Paper • 2403.07750 • Published Mar 12 • 21
RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches Paper • 2403.02709 • Published Mar 5 • 7
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models Paper • 2402.10524 • Published Feb 16 • 22
Lumos : Empowering Multimodal LLMs with Scene Text Recognition Paper • 2402.08017 • Published Feb 12 • 25
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions Paper • 2308.04152 • Published Aug 8, 2023 • 2
ScreenAI: A Vision-Language Model for UI and Infographics Understanding Paper • 2402.04615 • Published Feb 7 • 38