VisionLM - a zerozeyi Collection

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6 • 25

Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6 • 12

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Paper • 2402.05935 • Published Feb 8 • 15

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Paper • 2402.06118 • Published Feb 9 • 13

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

Paper • 2402.07456 • Published Feb 12 • 41

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

Paper • 2402.07872 • Published Feb 12 • 15

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Paper • 2402.07865 • Published Feb 12 • 12

World Model on Million-Length Video And Language With RingAttention

Paper • 2402.08268 • Published Feb 13 • 37

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

Paper • 2402.10896 • Published Feb 16 • 15

FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models

Paper • 2402.10986 • Published Feb 16 • 77

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Paper • 2402.12226 • Published Feb 19 • 41

CoLLaVO: Crayon Large Language and Vision mOdel

Paper • 2402.11248 • Published Feb 17 • 20

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

Paper • 2402.11690 • Published Feb 18 • 8

VideoPrism: A Foundational Visual Encoder for Video Understanding

Paper • 2402.13217 • Published Feb 20 • 22

Video ReCap: Recursive Captioning of Hour-Long Videos

Paper • 2402.13250 • Published Feb 20 • 25

A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20 • 13

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

Paper • 2402.13220 • Published Feb 20 • 13

BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models

Paper • 2402.13577 • Published Feb 21 • 8

PALO: A Polyglot Large Multimodal Model for 5B People

Paper • 2402.14818 • Published Feb 22 • 23

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Paper • 2402.14289 • Published Feb 22 • 19

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Paper • 2402.17177 • Published Feb 27 • 88

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Paper • 2402.19479 • Published Feb 29 • 32

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3 • 26

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

Paper • 2403.01487 • Published Mar 3 • 14

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

Paper • 2403.02677 • Published Mar 5 • 16

Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use

Paper • 2403.02626 • Published Mar 5 • 9

MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

Paper • 2403.03194 • Published Mar 5 • 12

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Paper • 2403.03003 • Published Mar 5 • 9

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Paper • 2403.09611 • Published Mar 14 • 124

MoAI: Mixture of All Intelligence for Large Language and Vision Models

Paper • 2403.07508 • Published Mar 12 • 75

Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Paper • 2403.07750 • Published Mar 12 • 21

DragAnything: Motion Control for Anything using Entity Representation

Paper • 2403.07420 • Published Mar 12 • 13

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Paper • 2403.06764 • Published Mar 11 • 25

VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11 • 27

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Paper • 2403.05135 • Published Mar 8 • 42

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Paper • 2403.05530 • Published Mar 8 • 60

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8 • 39

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Paper • 2403.05438 • Published Mar 8 • 18

Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

Paper • 2403.10301 • Published Mar 15 • 52

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15 • 31

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Paper • 2403.11703 • Published Mar 18 • 16

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Paper • 2403.11481 • Published Mar 18 • 12

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Paper • 2403.12895 • Published Mar 19 • 30

Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

Paper • 2403.12596 • Published Mar 19 • 9

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Paper • 2403.14624 • Published Mar 21 • 51

Can large language models explore in-context?

Paper • 2403.15371 • Published Mar 22 • 32

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Paper • 2403.15377 • Published Mar 22 • 22

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

Paper • 2403.15360 • Published Mar 22 • 11

VidLA: Video-Language Alignment at Scale

Paper • 2403.14870 • Published Mar 21 • 12

ViTAR: Vision Transformer with Any Resolution

Paper • 2403.18361 • Published Mar 27 • 52

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27 • 44

sDPO: Don't Use Your Data All at Once

Paper • 2403.19270 • Published Mar 28 • 40

TextCraftor: Your Text Encoder Can be Image Quality Controller

Paper • 2403.18978 • Published Mar 27 • 13

Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

Paper • 2403.20331 • Published Mar 29 • 14

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Paper • 2404.01197 • Published Apr 1 • 30

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Paper • 2404.01258 • Published Apr 1 • 10

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4 • 25

LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models

Paper • 2404.03118 • Published Apr 3 • 23

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Paper • 2404.03653 • Published Apr 4 • 33

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Paper • 2404.05719 • Published Apr 8 • 80

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Paper • 2404.05726 • Published Apr 8 • 20

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Paper • 2404.05674 • Published Apr 8 • 13

Koala: Key frame-conditioned long video-LLM

Paper • 2404.04346 • Published Apr 5 • 5

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9 • 29

Adapting LLaMA Decoder to Vision Transformer

Paper • 2404.06773 • Published Apr 10 • 17

BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10 • 18

Transferable and Principled Efficiency for Open-Vocabulary Segmentation

Paper • 2404.07448 • Published Apr 11 • 11

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Paper • 2404.07973 • Published Apr 11 • 30

HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

Paper • 2404.09990 • Published Apr 15 • 12

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Paper • 2404.09204 • Published Apr 14 • 10

On Speculative Decoding for Multimodal Large Language Models

Paper • 2404.08856 • Published Apr 13 • 13

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

Paper • 2404.12387 • Published Apr 18 • 38

BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18 • 24

MultiBooth: Towards Generating All Your Concepts in an Image from Text

Paper • 2404.14239 • Published Apr 22 • 8

A Multimodal Automated Interpretability Agent

Paper • 2404.14394 • Published Apr 22 • 20

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Paper • 2404.12803 • Published Apr 19 • 29

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19 • 30

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Paper • 2404.15653 • Published Apr 24 • 26

Editable Image Elements for Controllable Synthesis

Paper • 2404.16029 • Published Apr 24 • 10

MoDE: CLIP Data Experts via Clustering

Paper • 2404.16030 • Published Apr 24 • 12

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Paper • 2404.16790 • Published Apr 25 • 7

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25 • 53

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Paper • 2404.16375 • Published Apr 25 • 16

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Paper • 2404.16994 • Published Apr 25 • 35

HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections

Paper • 2404.16845 • Published Feb 14 • 6

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

Paper • 2404.17672 • Published Apr 26 • 18

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

Paper • 2404.17521 • Published Apr 26 • 12

Automatic Creative Selection with Cross-Modal Matching

Paper • 2405.00029 • Published Feb 28 • 7

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3 • 99

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

Paper • 2405.07990 • Published May 13 • 16

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

Paper • 2405.08344 • Published May 14 • 12

Understanding the performance gap between online and offline alignment algorithms

Paper • 2405.08448 • Published May 14 • 14

SpeechVerse: A Large-scale Generalizable Audio Language Model

Paper • 2405.08295 • Published May 14 • 14

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Paper • 2405.08317 • Published May 14 • 9

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Paper • 2405.09215 • Published May 15 • 18

LoRA Learns Less and Forgets Less

Paper • 2405.09673 • Published May 15 • 87

Many-Shot In-Context Learning in Multimodal Foundation Models

Paper • 2405.09798 • Published May 16 • 26

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16 • 126

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Paper • 2405.10300 • Published May 16 • 26

Toon3D: Seeing Cartoons from a New Perspective

Paper • 2405.10320 • Published May 16 • 19

Octo: An Open-Source Generalist Robot Policy

Paper • 2405.12213 • Published May 20 • 24

Imp: Highly Capable Large Multimodal Models for Mobile Devices

Paper • 2405.12107 • Published May 20 • 25

Your Transformer is Secretly Linear

Paper • 2405.12250 • Published May 19 • 150

Diffusion for World Modeling: Visual Details Matter in Atari

Paper • 2405.12399 • Published May 20 • 27

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Paper • 2405.14129 • Published May 23 • 12

CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

Paper • 2405.13195 • Published May 21 • 9

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24 • 53

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Paper • 2405.15216 • Published May 24 • 12

An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 85

Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27 • 31

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Paper • 2405.17428 • Published May 27 • 17

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Paper • 2405.15738 • Published May 24 • 43

Dense Connector for MLLMs

Paper • 2405.13800 • Published May 22 • 21

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Paper • 2405.14598 • Published May 23 • 11

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Paper • 2405.20204 • Published May 30 • 33

Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

Paper • 2405.18669 • Published May 29 • 11

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Paper • 2405.20340 • Published May 30 • 19

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Paper • 2405.21075 • Published May 31 • 19

Show, Don't Tell: Aligning Language Models with Demonstrated Feedback

Paper • 2406.00888 • Published Jun 2 • 30

Parrot: Multilingual Visual Instruction Tuning

Paper • 2406.02539 • Published Jun 4 • 35

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Paper • 2406.02884 • Published Jun 5 • 15

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6 • 72

AgentGym: Evolving Large Language Model-based Agents across Diverse Environments

Paper • 2406.04151 • Published Jun 6 • 17

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Paper • 2406.01014 • Published Jun 3 • 31

Vript: A Video Is Worth Thousands of Words

Paper • 2406.06040 • Published Jun 10 • 24

An Image is Worth 32 Tokens for Reconstruction and Generation

Paper • 2406.07550 • Published Jun 11 • 55

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Paper • 2406.06911 • Published Jun 11 • 10

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Paper • 2406.07476 • Published Jun 11 • 32

What If We Recaption Billions of Web Images with LLaMA-3?

Paper • 2406.08478 • Published Jun 12 • 39

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Paper • 2406.08407 • Published Jun 12 • 24

Needle In A Multimodal Haystack

Paper • 2406.07230 • Published Jun 11 • 52

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Paper • 2406.11839 • Published Jun 17 • 37

VideoLLM-online: Online Video Large Language Model for Streaming Video

Paper • 2406.11816 • Published Jun 17 • 21

TroL: Traversal of Layers for Large Language and Vision Models

Paper • 2406.12246 • Published Jun 18 • 34

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Paper • 2406.12275 • Published Jun 18 • 29

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Paper • 2406.12742 • Published Jun 18 • 14

Adversarial Attacks on Multimodal Agents

Paper • 2406.12814 • Published Jun 18 • 4

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Paper • 2406.11230 • Published Jun 17 • 34

Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models

Paper • 2406.12649 • Published Jun 18 • 15

Understanding Hallucinations in Diffusion Models through Mode Interpolation

Paper • 2406.09358 • Published Jun 13 • 4

CMC-Bench: Towards a New Paradigm of Visual Signal Compression

Paper • 2406.09356 • Published Jun 13 • 4

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Paper • 2406.09406 • Published Jun 13 • 13

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Paper • 2406.09403 • Published Jun 13 • 19

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Paper • 2406.09411 • Published Jun 13 • 18

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

Paper • 2406.08707 • Published Jun 13 • 15

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

Paper • 2406.09162 • Published Jun 13 • 13

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published Jun 12 • 28

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Paper • 2406.08451 • Published Jun 12 • 23

Are We Done with MMLU?

Paper • 2406.04127 • Published Jun 6 • 37

NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing

Paper • 2406.06523 • Published Jun 10 • 50

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Paper • 2406.08487 • Published Jun 12 • 11

VCR: Visual Caption Restoration

Paper • 2406.06462 • Published Jun 10 • 10

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13 • 50

OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13 • 36

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Paper • 2406.08552 • Published Jun 12 • 22

Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

Paper • 2406.04338 • Published Jun 6 • 34

Hibou: A Family of Foundational Vision Transformers for Pathology

Paper • 2406.05074 • Published Jun 7 • 6

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Paper • 2406.10210 • Published Jun 14 • 76

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

Paper • 2406.08973 • Published Jun 13 • 86

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Paper • 2406.11833 • Published Jun 17 • 61

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Paper • 2406.11831 • Published Jun 17 • 20

From Pixels to Prose: A Large Dataset of Dense Image Captions

Paper • 2406.10328 • Published Jun 14 • 17

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Paper • 2406.14544 • Published Jun 20 • 34

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Paper • 2406.11069 • Published Jun 16 • 13

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Paper • 2406.11271 • Published Jun 17 • 20

Task Me Anything

Paper • 2406.11775 • Published Jun 17 • 8

Unifying Multimodal Retrieval via Document Screenshot Embedding

Paper • 2406.11251 • Published Jun 17 • 9

The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

Paper • 2406.10601 • Published Jun 15 • 65

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Paper • 2406.14515 • Published Jun 20 • 32

Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models

Paper • 2406.14035 • Published Jun 20 • 12

ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

Paper • 2406.14596 • Published Jun 20 • 5

Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

Paper • 2406.11403 • Published Jun 17 • 4

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Paper • 2406.16338 • Published Jun 24 • 25

Long Context Transfer from Language to Vision

Paper • 2406.16852 • Published Jun 24 • 32

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Paper • 2406.16860 • Published Jun 24 • 57

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Paper • 2406.17770 • Published Jun 25 • 18

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Paper • 2406.15704 • Published Jun 22 • 5

Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26 • 47

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26 • 28

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Paper • 2406.15334 • Published Jun 21 • 8

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

Paper • 2406.17294 • Published Jun 25 • 10

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Paper • 2406.19389 • Published Jun 27 • 51

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Paper • 2406.18629 • Published Jun 26 • 41

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Paper • 2406.18790 • Published Jun 26 • 33

Simulating Classroom Education with LLM-Empowered Agents

Paper • 2406.19226 • Published Jun 27 • 29

AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Paper • 2406.10900 • Published Jun 16 • 11

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Paper • 2406.20095 • Published Jun 28 • 17

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Paper • 2406.20076 • Published Jun 28 • 8

Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Paper • 2406.17720 • Published Jun 25 • 7

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Paper • 2407.01284 • Published Jul 1 • 75

ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

Paper • 2406.19741 • Published Jun 28 • 59

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Paper • 2407.00468 • Published Jun 29 • 34

ColPali: Efficient Document Retrieval with Vision Language Models

Paper • 2407.01449 • Published Jun 27 • 41

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Paper • 2407.00114 • Published Jun 27 • 12

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2 • 21

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published Jul 3 • 92

TokenPacker: Efficient Visual Projector for Multimodal LLM

Paper • 2407.02392 • Published Jul 2 • 21

Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17 • 49

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Paper • 2406.08085 • Published Jun 12 • 13

Granular Privacy Control for Geolocation with Vision Language Models

Paper • 2407.04952 • Published Jul 6 • 4

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

Paper • 2407.06135 • Published Jul 8 • 20

Multi-Object Hallucination in Vision-Language Models

Paper • 2407.06192 • Published Jul 8 • 9

Vision language models are blind

Paper • 2407.06581 • Published Jul 9 • 82

VIMI: Grounding Video Generation through Multi-modal Instruction

Paper • 2407.06304 • Published Jul 8 • 9

Video-to-Audio Generation with Hidden Alignment

Paper • 2407.07464 • Published Jul 10 • 16

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Paper • 2407.03958 • Published Jul 4 • 18

Understanding Visual Feature Reliance through the Lens of Complexity

Paper • 2407.06076 • Published Jul 8 • 5

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Paper • 2407.06723 • Published Jul 9 • 10

PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10 • 67

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Paper • 2407.07895 • Published Jul 10 • 40

Do Vision and Language Models Share Concepts? A Vector Space Alignment Study

Paper • 2302.06555 • Published Feb 13, 2023 • 9

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Paper • 2407.08303 • Published Jul 11 • 17

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Paper • 2407.08583 • Published Jul 11 • 10

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Paper • 2407.07053 • Published Jul 9 • 41

E5-V: Universal Embeddings with Multimodal Large Language Models

Paper • 2407.12580 • Published Jul 17 • 39

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Paper • 2407.12679 • Published Jul 17 • 7

AUITestAgent: Automatic Requirements Oriented GUI Function Testing

Paper • 2407.09018 • Published Jul 12 • 5

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Paper • 2407.11298 • Published Jul 16 • 5

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Paper • 2407.12366 • Published Jul 17 • 4

Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

Paper • 2406.07057 • Published Jun 11 • 15

EVLM: An Efficient Vision-Language Model for Visual Understanding

Paper • 2407.14177 • Published Jul 19 • 42

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17 • 19

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22 • 39

VideoGameBunny: Towards vision assistants for video games

Paper • 2407.15295 • Published Jul 21 • 21

CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model

Paper • 2407.15233 • Published Jul 21 • 6

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23 • 25

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Paper • 2407.16655 • Published Jul 23 • 28

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Paper • 2407.16198 • Published Jul 23 • 13

VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24 • 39

Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

Paper • 2407.15815 • Published Jul 22 • 13

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

Paper • 2407.17490 • Published Jul 3 • 30

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Paper • 2407.18121 • Published Jul 25 • 16

VSSD: Vision Mamba with Non-Casual State Space Duality

Paper • 2407.18559 • Published Jul 26 • 18

Wolf: Captioning Everything with a World Summarization Framework

Paper • 2407.18908 • Published Jul 26 • 31

Diffusion Feedback Helps CLIP See Better

Paper • 2407.20171 • Published Jul 29 • 35

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

Paper • 2407.19795 • Published Jul 29 • 11

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Paper • 2407.19985 • Published Jul 29 • 35

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Paper • 2407.21770 • Published Jul 31 • 22

Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

Paper • 2407.21646 • Published Jul 31 • 18

ShieldGemma: Generative AI Content Moderation Based on Gemma

Paper • 2407.21772 • Published Jul 31 • 13

Open-Vocabulary Audio-Visual Semantic Segmentation

Paper • 2407.21721 • Published Jul 31 • 8

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1 • 108

OmniParser for Pure Vision Based GUI Agent

Paper • 2408.00203 • Published Aug 1 • 23

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

Paper • 2407.21794 • Published Jul 31 • 5

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Paper • 2408.01800 • Published Aug 3 • 78

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

Paper • 2408.02657 • Published Aug 5 • 32

Language Model Can Listen While Speaking

Paper • 2408.02622 • Published Aug 5 • 37

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

Paper • 2408.02210 • Published Aug 5 • 7

Operationalizing Contextual Integrity in Privacy-Conscious Assistants

Paper • 2408.02373 • Published Aug 5 • 4

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6 • 59

Diffusion Models as Data Mining Tools

Paper • 2408.02752 • Published Jul 20 • 13

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Paper • 2408.01708 • Published Aug 3 • 3

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Paper • 2408.03615 • Published Aug 7 • 30

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Paper • 2408.03695 • Published Aug 7 • 12

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

Paper • 2408.03900 • Published Aug 7 • 9

Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Paper • 2408.04567 • Published Aug 8 • 24

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Paper • 2408.04594 • Published Aug 8 • 14

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Paper • 2408.04631 • Published Aug 8 • 8

VITA: Towards Open-Source Interactive Omni Multimodal LLM

Paper • 2408.05211 • Published Aug 9 • 46

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Paper • 2408.04840 • Published Aug 9 • 31

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

Paper • 2408.04810 • Published Aug 9 • 22

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Paper • 2408.06070 • Published Aug 12 • 52

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Paper • 2408.06327 • Published Aug 12 • 15

UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

Paper • 2408.05939 • Published Aug 12 • 13

Imagen 3

Paper • 2408.07009 • Published Aug 13 • 61

Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

Paper • 2408.06663 • Published Aug 13 • 15

DeepSpeak Dataset v1.0

Paper • 2408.05366 • Published Aug 9 • 11

Towards flexible perception with visual memory

Paper • 2408.08172 • Published Aug 15 • 20

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16 • 97

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Paper • 2408.08459 • Published Aug 15 • 44

D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

Paper • 2408.08441 • Published Aug 15 • 7

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19 • 51

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Paper • 2408.11001 • Published Aug 20 • 11

Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

Paper • 2408.10119 • Published Aug 19 • 16

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20 • 56

NeCo: Improving DINOv2's spatial representations in 19 GPU hours with Patch Neighbor Consistency

Paper • 2408.11054 • Published Aug 20 • 11

Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model

Paper • 2408.10764 • Published Aug 20 • 8

Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos

Paper • 2408.10998 • Published Aug 20 • 8

MambaEVT: Event Stream based Visual Object Tracking using State Space Model

Paper • 2408.10487 • Published Aug 20 • 6

FocusLLM: Scaling LLM's Context by Parallel Decoding

Paper • 2408.11745 • Published Aug 21 • 23

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Paper • 2408.11318 • Published Aug 21 • 54

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Paper • 2408.11817 • Published Aug 21 • 8

FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

Paper • 2408.11706 • Published Aug 21 • 6

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Paper • 2408.11475 • Published Aug 21 • 17

Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification

Paper • 2408.11237 • Published Aug 20 • 5

Iterative Object Count Optimization for Text-to-image Diffusion Models

Paper • 2408.11721 • Published Aug 21 • 5

Sapiens: Foundation for Human Vision Models

Paper • 2408.12569 • Published Aug 22 • 89

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22 • 50

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20 • 51

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22 • 34

Scalable Autoregressive Image Generation with Mamba

Paper • 2408.12245 • Published Aug 22 • 25

Real-Time Video Generation with Pyramid Attention Broadcast

Paper • 2408.12588 • Published Aug 22 • 15

SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

Paper • 2408.12114 • Published Aug 22 • 12

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Paper • 2408.09787 • Published Aug 19 • 7

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22 • 118

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Paper • 2408.13257 • Published Aug 23 • 25

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Paper • 2408.13239 • Published Aug 23 • 11

Foundation Models for Music: A Survey

Paper • 2408.14340 • Published Aug 26 • 42

LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

Paper • 2408.13402 • Published Aug 23 • 18

TVG: A Training-free Transition Video Generation Method with Diffusion Models

Paper • 2408.13413 • Published Aug 24 • 13

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

Paper • 2408.15079 • Published Aug 27 • 52

Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29 • 92

CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29 • 56

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Paper • 2408.16532 • Published Aug 29 • 47

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Paper • 2408.16725 • Published Aug 29 • 52

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Paper • 2408.17253 • Published Aug 30 • 35

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17 • 51

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Paper • 2409.01071 • Published Sep 2 • 26

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

Paper • 2409.02095 • Published Sep 3 • 35

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3 • 32

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Paper • 2409.02889 • Published Sep 4 • 54

Attention Heads of Large Language Models: A Survey

Paper • 2409.03752 • Published Sep 5 • 88

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Paper • 2409.04410 • Published Sep 6 • 23

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Paper • 2409.05840 • Published Sep 9 • 45

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Paper • 2409.02795 • Published Sep 4 • 72

POINTS: Improving Your Vision-language Model with Affordable Strategies

Paper • 2409.04828 • Published Sep 7 • 22

Benchmarking Chinese Knowledge Rectification in Large Language Models

Paper • 2409.05806 • Published Sep 9 • 14

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published Sep 10 • 55

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Paper • 2409.06135 • Published Sep 10 • 14

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

Paper • 2409.06820 • Published Sep 10 • 63

MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis

Paper • 2409.07129 • Published Sep 11 • 6

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

Paper • 2409.07239 • Published Sep 11 • 11

Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models

Paper • 2409.06277 • Published Sep 10 • 14

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Paper • 2409.09269 • Published Sep 14 • 7

One missing piece in Vision and Language: A Survey on Comics Understanding

Paper • 2409.09502 • Published Sep 14 • 23

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17 • 72

OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17 • 108

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Paper • 2409.11355 • Published Sep 17 • 28

OSV: One Step is Enough for High-Quality Image to Video Generation

Paper • 2409.11367 • Published Sep 17 • 13

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Paper • 2409.03420 • Published Sep 5 • 25

InstantDrag: Improving Interactivity in Drag-based Image Editing

Paper • 2409.08857 • Published Sep 13 • 30

AudioBERT: Audio Knowledge Augmented Language Model

Paper • 2409.08199 • Published Sep 12 • 4

LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study

Paper • 2409.08554 • Published Sep 13 • 3

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18 • 74

Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18 • 136

Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

Paper • 2409.11564 • Published Sep 17 • 19

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Paper • 2409.12139 • Published Sep 18 • 11

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Paper • 2409.12961 • Published Sep 19 • 24

StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

Paper • 2409.12576 • Published Sep 19 • 15

Imagine yourself: Tuning-Free Personalized Image Generation

Paper • 2409.13346 • Published Sep 20 • 67

YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

Paper • 2409.13592 • Published Sep 20 • 48

Portrait Video Editing Empowered by Multimodal Generative Priors

Paper • 2409.13591 • Published Sep 20 • 15

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Paper • 2409.15278 • Published Sep 23 • 22

Phantom of Latent for Large Language and Vision Models

Paper • 2409.14713 • Published Sep 23 • 27

Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections

Paper • 2409.14677 • Published Sep 23 • 14

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Paper • 2409.16160 • Published Sep 24 • 32

MonoFormer: One Transformer for Both Diffusion and Autoregression

Paper • 2409.16280 • Published Sep 24 • 17

Seeing Faces in Things: A Model and Dataset for Pareidolia

Paper • 2409.16143 • Published Sep 24 • 15

Attention Prompting on Image for Large Vision-Language Models

Paper • 2409.17143 • Published Sep 25 • 7

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25 • 103

MIO: A Foundation Model on Multimodal Tokens

Paper • 2409.17692 • Published Sep 26 • 50

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Paper • 2409.20566 • Published Sep 30 • 52

Visual Question Decomposition on Multimodal Large Language Models

Paper • 2409.19339 • Published Sep 28 • 7

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Paper • 2410.02757 • Published Oct 3 • 36

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3 • 52

LLaVA-Critic: Learning to Evaluate Multimodal Models

Paper • 2410.02712 • Published Oct 3 • 34

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

Paper • 2410.02762 • Published Oct 3 • 9

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Paper • 2410.02763 • Published Oct 3 • 7

Addition is All You Need for Energy-efficient Language Models

Paper • 2410.00907 • Published Oct 1 • 144

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Paper • 2410.04364 • Published Oct 6 • 27

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Paper • 2410.05243 • Published Oct 7 • 16

UniMuMo: Unified Text, Music and Motion Generation

Paper • 2410.04534 • Published Oct 6 • 18

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Paper • 2410.04734 • Published Oct 7 • 16

OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

Paper • 2410.04932 • Published Oct 7 • 9

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Paper • 2410.01912 • Published Oct 2 • 13

ControlAR: Controllable Image Generation with Autoregressive Models

Paper • 2410.02705 • Published Oct 3 • 8

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Paper • 2410.03290 • Published Oct 4 • 6

Aria: An Open Multimodal Native Mixture-of-Experts Model

Paper • 2410.05993 • Published Oct 8 • 107

Personalized Visual Instruction Tuning

Paper • 2410.07113 • Published Oct 9 • 69

Pixtral 12B

Paper • 2410.07073 • Published Oct 9 • 60

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Paper • 2410.07171 • Published Oct 9 • 41

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Paper • 2410.07167 • Published Oct 9 • 37

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Paper • 2410.06373 • Published Oct 8 • 34

Pyramidal Flow Matching for Efficient Video Generative Modeling

Paper • 2410.05954 • Published Oct 8 • 37

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Paper • 2410.05363 • Published Oct 7 • 44

Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

Paper • 2410.06244 • Published Oct 8 • 19

MM-Ego: Towards Building Egocentric Multimodal LLMs

Paper • 2410.07177 • Published Oct 9 • 20

TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

Paper • 2410.05591 • Published Oct 8 • 13

Temporal Reasoning Transfer from Text to Video

Paper • 2410.06166 • Published Oct 8 • 12

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Paper • 2410.03450 • Published Oct 4 • 36

Intriguing Properties of Large Language and Vision Models

Paper • 2410.04751 • Published Oct 7 • 16

Progressive Autoregressive Video Diffusion Models

Paper • 2410.08151 • Published Oct 10 • 15

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Paper • 2410.05210 • Published Oct 7 • 10

Self-Boosting Large Language Models with Synthetic Preference Data

Paper • 2410.06961 • Published Oct 9 • 15

WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Paper • 2410.07484 • Published Oct 9 • 48

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Paper • 2410.08164 • Published Oct 10 • 24

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Paper • 2410.06154 • Published Oct 8 • 16

Baichuan-Omni Technical Report

Paper • 2410.08565 • Published Oct 11 • 84

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Paper • 2410.06456 • Published Oct 9 • 35

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Paper • 2410.07133 • Published Oct 9 • 18

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Paper • 2410.10139 • Published Oct 14 • 50

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Paper • 2410.10594 • Published Oct 14 • 22

MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

Paper • 2410.11779 • Published Oct 15 • 24

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Paper • 2410.10816 • Published Oct 14 • 19

Improving Long-Text Alignment for Text-to-Image Diffusion Models

Paper • 2410.11817 • Published Oct 15 • 14

OMCAT: Omni Context Aware Transformer

Paper • 2410.12109 • Published Oct 15 • 4

VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI

Paper • 2410.11623 • Published Oct 15 • 46

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

Paper • 2410.12381 • Published Oct 16 • 42

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Paper • 2410.12787 • Published Oct 16 • 30

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Paper • 2410.13848 • Published Oct 17 • 27

Harnessing Webpage UIs for Text-Rich Visual Understanding

Paper • 2410.13824 • Published Oct 17 • 29

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

Paper • 2410.12705 • Published Oct 16 • 29

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Paper • 2410.13863 • Published Oct 17 • 35

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Paper • 2410.13757 • Published Oct 17 • 31

Roadmap towards Superhuman Speech Understanding using Large Language Models

Paper • 2410.13268 • Published Oct 17 • 33

Movie Gen: A Cast of Media Foundation Models

Paper • 2410.13720 • Published Oct 17 • 88

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

Paper • 2410.13830 • Published Oct 17 • 23

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

Paper • 2410.13085 • Published Oct 16 • 20

A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Paper • 2410.13639 • Published Oct 17 • 16

VidPanos: Generative Panoramic Videos from Casual Panning Videos

Paper • 2410.13832 • Published Oct 17 • 12

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Paper • 2410.13360 • Published Oct 17 • 8

γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Paper • 2410.13859 • Published Oct 17 • 7

Can MLLMs Understand the Deep Implication Behind Chinese Images?

Paper • 2410.13854 • Published Oct 17 • 8

FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

Paper • 2410.13925 • Published Oct 17 • 22

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Paper • 2410.11190 • Published Oct 15 • 20

SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation

Paper • 2410.14745 • Published Oct 17 • 45

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Paper • 2410.16268 • Published Oct 21 • 65

Baichuan Alignment Technical Report

Paper • 2410.14940 • Published Oct 19 • 48

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17 • 53

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

Paper • 2410.09347 • Published Oct 12 • 4

AutoTrain: No-code training for state-of-the-art models

Paper • 2410.15735 • Published Oct 21 • 57

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Paper • 2410.16184 • Published Oct 21 • 23

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Paper • 2410.15316 • Published Oct 20 • 10

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Paper • 2410.17247 • Published about 1 month ago • 43

Aligning Large Language Models via Self-Steering Optimization

Paper • 2410.17131 • Published about 1 month ago • 21

Improve Vision Language Model Chain-of-thought Reasoning

Paper • 2410.16198 • Published Oct 21 • 17

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Paper • 2410.16267 • Published Oct 21 • 15

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Paper • 2410.17637 • Published 30 days ago • 34

Can Knowledge Editing Really Correct Hallucinations?

Paper • 2410.16251 • Published Oct 21 • 54

LOGO -- Long cOntext aliGnment via efficient preference Optimization

Paper • 2410.18533 • Published 29 days ago • 42

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Paper • 2410.18798 • Published 28 days ago • 19

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Paper • 2410.18558 • Published 28 days ago • 18

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Paper • 2410.17779 • Published 29 days ago • 7

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

Paper • 2410.17856 • Published 29 days ago • 49

Continuous Speech Synthesis using per-token Latent Diffusion

Paper • 2410.16048 • Published Oct 21 • 28

GPT-4o System Card

Paper • 2410.21276 • Published 27 days ago • 79

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Paper • 2410.21220 • Published 24 days ago • 8

CLEAR: Character Unlearning in Textual and Visual Modalities

Paper • 2410.18057 • Published 29 days ago • 199

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Paper • 2410.22587 • Published 23 days ago • 8

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

Paper • 2410.23287 • Published 22 days ago • 17

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published 22 days ago • 46

Personalization of Large Language Models: A Survey

Paper • 2411.00027 • Published 24 days ago • 31

Randomized Autoregressive Visual Generation

Paper • 2411.00776 • Published 20 days ago • 17

Face Anonymization Made Simple

Paper • 2411.00762 • Published 20 days ago • 7

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Paper • 2410.24024 • Published 21 days ago • 48

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Paper • 2411.02337 • Published 17 days ago • 36

How Far is Video Generation from World Model: A Physical Law Perspective

Paper • 2411.02385 • Published 17 days ago • 32

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Paper • 2411.02265 • Published 17 days ago • 24

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Paper • 2411.02397 • Published 17 days ago • 20

AutoVFX: Physically Realistic Video Editing from Natural Language Instructions

Paper • 2411.02394 • Published 17 days ago • 17

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Paper • 2411.02359 • Published 17 days ago • 12

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

Paper • 2411.03823 • Published 15 days ago • 43

Adaptive Length Image Tokenization via Recurrent Allocation

Paper • 2411.02393 • Published 17 days ago • 12

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Paper • 2411.05003 • Published 14 days ago • 67

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

Paper • 2411.04709 • Published 16 days ago • 25

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

Paper • 2411.04952 • Published 14 days ago • 27

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Paper • 2411.05000 • Published 14 days ago • 21

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Paper • 2411.04923 • Published 14 days ago • 20

Analyzing The Language of Visual Tokens

Paper • 2411.05001 • Published 14 days ago • 20

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

Paper • 2411.04997 • Published 14 days ago • 34

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

Paper • 2411.04097 • Published 15 days ago • 5

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Paper • 2411.07199 • Published 10 days ago • 42

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Paper • 2411.07140 • Published 10 days ago • 33

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

Paper • 2411.07126 • Published 10 days ago • 28

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Paper • 2411.07232 • Published 10 days ago • 60

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Paper • 2411.07975 • Published 9 days ago • 24

Autoregressive Models in Vision: A Survey

Paper • 2411.05902 • Published 13 days ago • 13

MagicQuill: An Intelligent Interactive Image Editing System

Paper • 2411.09703 • Published 7 days ago • 50

Sharingan: Extract User Action Sequence from Desktop Recordings

Paper • 2411.08768 • Published 8 days ago • 9

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published 6 days ago • 87