zerozeyi
's Collections
VisionLM
updated
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper
•
2402.04252
•
Published
•
25
Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models
Paper
•
2402.03749
•
Published
•
12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper
•
2402.04615
•
Published
•
36
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance
Loss
Paper
•
2402.05008
•
Published
•
19
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
Paper
•
2402.05930
•
Published
•
39
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
•
2402.05935
•
Published
•
15
ViGoR: Improving Visual Grounding of Large Vision Language Models with
Fine-Grained Reward Modeling
Paper
•
2402.06118
•
Published
•
13
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
Paper
•
2402.07456
•
Published
•
41
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
Paper
•
2402.07872
•
Published
•
15
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned
Language Models
Paper
•
2402.07865
•
Published
•
12
World Model on Million-Length Video And Language With RingAttention
Paper
•
2402.08268
•
Published
•
36
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong
Vision-language Adapter
Paper
•
2402.10896
•
Published
•
14
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language
Models
Paper
•
2402.10986
•
Published
•
76
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper
•
2402.12226
•
Published
•
40
CoLLaVO: Crayon Large Language and Vision mOdel
Paper
•
2402.11248
•
Published
•
18
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
Paper
•
2402.11690
•
Published
•
7
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper
•
2402.13217
•
Published
•
21
Video ReCap: Recursive Captioning of Hour-Long Videos
Paper
•
2402.13250
•
Published
•
22
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
•
2402.13232
•
Published
•
13
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on
Deceptive Prompts
Paper
•
2402.13220
•
Published
•
12
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
Vision-Language Models
Paper
•
2402.13577
•
Published
•
7
PALO: A Polyglot Large Multimodal Model for 5B People
Paper
•
2402.14818
•
Published
•
23
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
•
2402.14289
•
Published
•
19
Sora: A Review on Background, Technology, Limitations, and Opportunities
of Large Vision Models
Paper
•
2402.17177
•
Published
•
88
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper
•
2402.19479
•
Published
•
32
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper
•
2403.01422
•
Published
•
26
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
Paper
•
2403.01487
•
Published
•
15
Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters
Paper
•
2403.02677
•
Published
•
16
Modeling Collaborator: Enabling Subjective Vision Classification With
Minimal Human Effort via LLM Tool-Use
Paper
•
2403.02626
•
Published
•
9
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal
Datasets
Paper
•
2403.03194
•
Published
•
12
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large
Language Models
Paper
•
2403.03003
•
Published
•
9
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
•
2403.09611
•
Published
•
124
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
75
Synth^2: Boosting Visual-Language Models with Synthetic Captions and
Image Embeddings
Paper
•
2403.07750
•
Published
•
21
DragAnything: Motion Control for Anything using Entity Representation
Paper
•
2403.07420
•
Published
•
12
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference
Acceleration for Large Vision-Language Models
Paper
•
2403.06764
•
Published
•
25
VideoMamba: State Space Model for Efficient Video Understanding
Paper
•
2403.06977
•
Published
•
27
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
•
2403.05135
•
Published
•
42
Gemini 1.5: Unlocking multimodal understanding across millions of tokens
of context
Paper
•
2403.05530
•
Published
•
59
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
•
2403.05525
•
Published
•
39
VideoElevator: Elevating Video Generation Quality with Versatile
Text-to-Image Diffusion Models
Paper
•
2403.05438
•
Published
•
18
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
•
2403.10301
•
Published
•
51
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
•
2403.10517
•
Published
•
30
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
•
2403.11703
•
Published
•
16
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper
•
2403.11481
•
Published
•
11
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document
Understanding
Paper
•
2403.12895
•
Published
•
29
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper
•
2403.12596
•
Published
•
9
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
•
2403.14624
•
Published
•
50
Can large language models explore in-context?
Paper
•
2403.15371
•
Published
•
31
InternVideo2: Scaling Video Foundation Models for Multimodal Video
Understanding
Paper
•
2403.15377
•
Published
•
21
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate
Time series
Paper
•
2403.15360
•
Published
•
11
VidLA: Video-Language Alignment at Scale
Paper
•
2403.14870
•
Published
•
12
ViTAR: Vision Transformer with Any Resolution
Paper
•
2403.18361
•
Published
•
51
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
44
sDPO: Don't Use Your Data All at Once
Paper
•
2403.19270
•
Published
•
38
TextCraftor: Your Text Encoder Can be Image Quality Controller
Paper
•
2403.18978
•
Published
•
13
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision
Language Models
Paper
•
2403.20331
•
Published
•
14
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
•
2404.01197
•
Published
•
29
Direct Preference Optimization of Video Large Multimodal Models from
Language Model Reward
Paper
•
2404.01258
•
Published
•
10
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
Interleaved Visual-Textual Tokens
Paper
•
2404.03413
•
Published
•
25
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
•
2404.03118
•
Published
•
23
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
Matching
Paper
•
2404.03653
•
Published
•
32
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
•
2404.05719
•
Published
•
62
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
Understanding
Paper
•
2404.05726
•
Published
•
20
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
Paper
•
2404.05674
•
Published
•
13
Koala: Key frame-conditioned long video-LLM
Paper
•
2404.04346
•
Published
•
5
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
29
Adapting LLaMA Decoder to Vision Transformer
Paper
•
2404.06773
•
Published
•
17
BRAVE: Broadening the visual encoding of vision-language models
Paper
•
2404.07204
•
Published
•
17
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Paper
•
2404.07448
•
Published
•
11
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
•
2404.07973
•
Published
•
30
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
Paper
•
2404.09990
•
Published
•
12
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal
Large Language Models
Paper
•
2404.09204
•
Published
•
10
On Speculative Decoding for Multimodal Large Language Models
Paper
•
2404.08856
•
Published
•
13
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
•
2404.12387
•
Published
•
38
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
24
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Paper
•
2404.14239
•
Published
•
8
A Multimodal Automated Interpretability Agent
Paper
•
2404.14394
•
Published
•
20
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper
•
2404.12803
•
Published
•
29
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
•
2404.13013
•
Published
•
29
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster
Pre-training on Web-scale Image-Text Data
Paper
•
2404.15653
•
Published
•
26
Editable Image Elements for Controllable Synthesis
Paper
•
2404.16029
•
Published
•
10
MoDE: CLIP Data Experts via Clustering
Paper
•
2404.16030
•
Published
•
12
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
•
2404.16790
•
Published
•
7
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
•
2404.16821
•
Published
•
53
List Items One by One: A New Data Source and Learning Paradigm for
Multimodal LLMs
Paper
•
2404.16375
•
Published
•
16
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
•
2404.16994
•
Published
•
35
HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring
Unconstrained Photo Collections
Paper
•
2404.16845
•
Published
•
6
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
Paper
•
2404.17672
•
Published
•
18
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual
and Action Representations
Paper
•
2404.17521
•
Published
•
12
Automatic Creative Selection with Cross-Modal Matching
Paper
•
2405.00029
•
Published
•
7
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
98
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large
Language Models in Code Generation from Scientific Plots
Paper
•
2405.07990
•
Published
•
16
No Time to Waste: Squeeze Time into Channel for Mobile Video
Understanding
Paper
•
2405.08344
•
Published
•
12
Understanding the performance gap between online and offline alignment
algorithms
Paper
•
2405.08448
•
Published
•
14
SpeechVerse: A Large-scale Generalizable Audio Language Model
Paper
•
2405.08295
•
Published
•
14
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large
Language Models
Paper
•
2405.08317
•
Published
•
9
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Paper
•
2405.09215
•
Published
•
18
LoRA Learns Less and Forgets Less
Paper
•
2405.09673
•
Published
•
87
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
•
2405.09798
•
Published
•
26
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
•
2405.09818
•
Published
•
125
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper
•
2405.10300
•
Published
•
26
Toon3D: Seeing Cartoons from a New Perspective
Paper
•
2405.10320
•
Published
•
19
Octo: An Open-Source Generalist Robot Policy
Paper
•
2405.12213
•
Published
•
23
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Paper
•
2405.12107
•
Published
•
25
Your Transformer is Secretly Linear
Paper
•
2405.12250
•
Published
•
150
Diffusion for World Modeling: Visual Details Matter in Atari
Paper
•
2405.12399
•
Published
•
26
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
Capability
Paper
•
2405.14129
•
Published
•
12
CamViG: Camera Aware Image-to-Video Generation with Multimodal
Transformers
Paper
•
2405.13195
•
Published
•
9
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models
Paper
•
2405.15574
•
Published
•
53
Denoising LM: Pushing the Limits of Error Correction Models for Speech
Recognition
Paper
•
2405.15216
•
Published
•
12
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
85
Matryoshka Multimodal Models
Paper
•
2405.17430
•
Published
•
30
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding
Models
Paper
•
2405.17428
•
Published
•
16
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
•
2405.15738
•
Published
•
43
Dense Connector for MLLMs
Paper
•
2405.13800
•
Published
•
21
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Paper
•
2405.14598
•
Published
•
11
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper
•
2405.20204
•
Published
•
29
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper
•
2405.18669
•
Published
•
11
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper
•
2405.20340
•
Published
•
19
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
•
2405.21075
•
Published
•
18
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Paper
•
2406.00888
•
Published
•
30
Parrot: Multilingual Visual Instruction Tuning
Paper
•
2406.02539
•
Published
•
35
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with
LLM
Paper
•
2406.02884
•
Published
•
14
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
•
2406.04325
•
Published
•
71
AgentGym: Evolving Large Language Model-based Agents across Diverse
Environments
Paper
•
2406.04151
•
Published
•
17
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective
Navigation via Multi-Agent Collaboration
Paper
•
2406.01014
•
Published
•
30
Vript: A Video Is Worth Thousands of Words
Paper
•
2406.06040
•
Published
•
22
An Image is Worth 32 Tokens for Reconstruction and Generation
Paper
•
2406.07550
•
Published
•
55
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Paper
•
2406.06911
•
Published
•
10
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
Understanding in Video-LLMs
Paper
•
2406.07476
•
Published
•
32
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
•
2406.08478
•
Published
•
39
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
in Videos
Paper
•
2406.08407
•
Published
•
24
Needle In A Multimodal Haystack
Paper
•
2406.07230
•
Published
•
52
mDPO: Conditional Preference Optimization for Multimodal Large Language
Models
Paper
•
2406.11839
•
Published
•
36
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
•
2406.11816
•
Published
•
20
TroL: Traversal of Layers for Large Language and Vision Models
Paper
•
2406.12246
•
Published
•
34
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper
•
2406.12275
•
Published
•
29
Benchmarking Multi-Image Understanding in Vision and Language Models:
Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Paper
•
2406.12742
•
Published
•
14
Adversarial Attacks on Multimodal Agents
Paper
•
2406.12814
•
Published
•
4
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of
Multimodal Large Language Models
Paper
•
2406.11230
•
Published
•
34
Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations
for Vision Foundation Models
Paper
•
2406.12649
•
Published
•
15
Understanding Hallucinations in Diffusion Models through Mode
Interpolation
Paper
•
2406.09358
•
Published
•
4
CMC-Bench: Towards a New Paradigm of Visual Signal Compression
Paper
•
2406.09356
•
Published
•
4
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper
•
2406.09406
•
Published
•
13
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
•
2406.09403
•
Published
•
19
MuirBench: A Comprehensive Benchmark for Robust Multi-image
Understanding
Paper
•
2406.09411
•
Published
•
18
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
•
2406.08707
•
Published
•
15
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal
Prompts
Paper
•
2406.09162
•
Published
•
13
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
•
2406.08418
•
Published
•
28
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on
Mobile Devices
Paper
•
2406.08451
•
Published
•
23
Paper
•
2406.04127
•
Published
•
37
NaRCan: Natural Refined Canonical Image with Integration of Diffusion
Prior for Video Editing
Paper
•
2406.06523
•
Published
•
50
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Paper
•
2406.08487
•
Published
•
11
VCR: Visual Caption Restoration
Paper
•
2406.06462
•
Published
•
10
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
•
2406.09415
•
Published
•
50
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
•
2406.09246
•
Published
•
36
DiTFastAttn: Attention Compression for Diffusion Transformer Models
Paper
•
2406.08552
•
Published
•
22
Physics3D: Learning Physical Properties of 3D Gaussians via Video
Diffusion
Paper
•
2406.04338
•
Published
•
34
Hibou: A Family of Foundational Vision Transformers for Pathology
Paper
•
2406.05074
•
Published
•
6
Make It Count: Text-to-Image Generation with an Accurate Number of
Objects
Paper
•
2406.10210
•
Published
•
76
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context
Reinforcement Learning
Paper
•
2406.08973
•
Published
•
85
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
•
2406.11833
•
Published
•
61
Exploring the Role of Large Language Models in Prompt Encoding for
Diffusion Models
Paper
•
2406.11831
•
Published
•
19
From Pixels to Prose: A Large Dataset of Dense Image Captions
Paper
•
2406.10328
•
Published
•
16
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Paper
•
2406.14544
•
Published
•
34
WildVision: Evaluating Vision-Language Models in the Wild with Human
Preferences
Paper
•
2406.11069
•
Published
•
13
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
•
2406.11271
•
Published
•
18
Paper
•
2406.11775
•
Published
•
8
Unifying Multimodal Retrieval via Document Screenshot Embedding
Paper
•
2406.11251
•
Published
•
8
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN
Inversion and High Quality Image Editing
Paper
•
2406.10601
•
Published
•
65
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
Understanding
Paper
•
2406.14515
•
Published
•
32
Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation
Modelling in Large Multimodal Models
Paper
•
2406.14035
•
Published
•
11
ICAL: Continual Learning of Multimodal Agents by Transforming
Trajectories into Actionable Insights
Paper
•
2406.14596
•
Published
•
5
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical
Report
Paper
•
2406.11403
•
Published
•
4
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in
Large Video-Language Models
Paper
•
2406.16338
•
Published
•
24
Long Context Transfer from Language to Vision
Paper
•
2406.16852
•
Published
•
32
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
•
2406.16860
•
Published
•
55
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Paper
•
2406.17770
•
Published
•
18
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Paper
•
2406.15704
•
Published
•
5
Octo-planner: On-device Language Model for Planner-Action Agents
Paper
•
2406.18082
•
Published
•
47
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal
LLMs
Paper
•
2406.18521
•
Published
•
25
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper
•
2406.15334
•
Published
•
8
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large
Language Models
Paper
•
2406.17294
•
Published
•
10
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
•
2406.19389
•
Published
•
51
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
•
2406.18629
•
Published
•
40
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
Paper
•
2406.18790
•
Published
•
33
Simulating Classroom Education with LLM-Empowered Agents
Paper
•
2406.19226
•
Published
•
28
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for
Vision-Language Models
Paper
•
2406.10900
•
Published
•
11
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Paper
•
2406.20095
•
Published
•
17
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything
Model
Paper
•
2406.20076
•
Published
•
8
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
Paper
•
2406.17720
•
Published
•
7
We-Math: Does Your Large Multimodal Model Achieve Human-like
Mathematical Reasoning?
Paper
•
2407.01284
•
Published
•
75
ROS-LLM: A ROS framework for embodied AI with task feedback and
structured reasoning
Paper
•
2406.19741
•
Published
•
59
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and
Efficient Evaluation
Paper
•
2407.00468
•
Published
•
35
ColPali: Efficient Document Retrieval with Vision Language Models
Paper
•
2407.01449
•
Published
•
40
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables
Open-World Instruction Following Agents
Paper
•
2407.00114
•
Published
•
12
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
•
2407.02477
•
Published
•
21
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
•
2407.03320
•
Published
•
92
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
•
2407.02392
•
Published
•
21
Unveiling Encoder-Free Vision-Language Models
Paper
•
2406.11832
•
Published
•
49
Flash-VStream: Memory-Based Real-Time Understanding for Long Video
Streams
Paper
•
2406.08085
•
Published
•
13
Granular Privacy Control for Geolocation with Vision Language Models
Paper
•
2407.04952
•
Published
•
4
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
•
2407.06135
•
Published
•
20
Multi-Object Hallucination in Vision-Language Models
Paper
•
2407.06192
•
Published
•
9
Vision language models are blind
Paper
•
2407.06581
•
Published
•
80
VIMI: Grounding Video Generation through Multi-modal Instruction
Paper
•
2407.06304
•
Published
•
9
Video-to-Audio Generation with Hidden Alignment
Paper
•
2407.07464
•
Published
•
16
Stark: Social Long-Term Multi-Modal Conversation with Persona
Commonsense Knowledge
Paper
•
2407.03958
•
Published
•
18
Understanding Visual Feature Reliance through the Lens of Complexity
Paper
•
2407.06076
•
Published
•
5
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting
Region Captions
Paper
•
2407.06723
•
Published
•
10
PaliGemma: A versatile 3B VLM for transfer
Paper
•
2407.07726
•
Published
•
65
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
•
2407.07895
•
Published
•
40
Do Vision and Language Models Share Concepts? A Vector Space Alignment
Study
Paper
•
2302.06555
•
Published
•
9
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal
Perception
Paper
•
2407.08303
•
Published
•
17
The Synergy between Data and Multi-Modal Large Language Models: A Survey
from Co-Development Perspective
Paper
•
2407.08583
•
Published
•
10
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
•
2407.07053
•
Published
•
41
E5-V: Universal Embeddings with Multimodal Large Language Models
Paper
•
2407.12580
•
Published
•
38
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Paper
•
2407.12679
•
Published
•
7
AUITestAgent: Automatic Requirements Oriented GUI Function Testing
Paper
•
2407.09018
•
Published
•
5
ThinkGrasp: A Vision-Language System for Strategic Part Grasping in
Clutter
Paper
•
2407.11298
•
Published
•
5
NavGPT-2: Unleashing Navigational Reasoning Capability for Large
Vision-Language Models
Paper
•
2407.12366
•
Published
•
4
Benchmarking Trustworthiness of Multimodal Large Language Models: A
Comprehensive Study
Paper
•
2406.07057
•
Published
•
15
EVLM: An Efficient Vision-Language Model for Visual Understanding
Paper
•
2407.14177
•
Published
•
42
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document
Understanding
Paper
•
2407.12594
•
Published
•
19
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
•
2407.15841
•
Published
•
38
VideoGameBunny: Towards vision assistants for video games
Paper
•
2407.15295
•
Published
•
21
CGB-DM: Content and Graphic Balance Layout Generation with
Transformer-based Diffusion Model
Paper
•
2407.15233
•
Published
•
6
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
Person
Paper
•
2407.16224
•
Published
•
23
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
Paper
•
2407.16655
•
Published
•
28
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
•
2407.16198
•
Published
•
13
VILA^2: VILA Augmented VILA
Paper
•
2407.17453
•
Published
•
38
Learning to Manipulate Anywhere: A Visual Generalizable Framework For
Reinforcement Learning
Paper
•
2407.15815
•
Published
•
13
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Paper
•
2407.17490
•
Published
•
30
Efficient Inference of Vision Instruction-Following Models with Elastic
Cache
Paper
•
2407.18121
•
Published
•
15
VSSD: Vision Mamba with Non-Casual State Space Duality
Paper
•
2407.18559
•
Published
•
16
Wolf: Captioning Everything with a World Summarization Framework
Paper
•
2407.18908
•
Published
•
30
Diffusion Feedback Helps CLIP See Better
Paper
•
2407.20171
•
Published
•
34
VolDoGer: LLM-assisted Datasets for Domain Generalization in
Vision-Language Tasks
Paper
•
2407.19795
•
Published
•
10
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper
•
2407.19985
•
Published
•
33
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
Experts
Paper
•
2407.21770
•
Published
•
22
Towards Achieving Human Parity on End-to-end Simultaneous Speech
Translation via LLM Agent
Paper
•
2407.21646
•
Published
•
18
ShieldGemma: Generative AI Content Moderation Based on Gemma
Paper
•
2407.21772
•
Published
•
13
Open-Vocabulary Audio-Visual Semantic Segmentation
Paper
•
2407.21721
•
Published
•
8
SAM 2: Segment Anything in Images and Videos
Paper
•
2408.00714
•
Published
•
104
OmniParser for Pure Vision Based GUI Agent
Paper
•
2408.00203
•
Published
•
17
Generalized Out-of-Distribution Detection and Beyond in Vision Language
Model Era: A Survey
Paper
•
2407.21794
•
Published
•
4
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
•
2408.01800
•
Published
•
74
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation
with Multimodal Generative Pretraining
Paper
•
2408.02657
•
Published
•
32
Language Model Can Listen While Speaking
Paper
•
2408.02622
•
Published
•
37
ExoViP: Step-by-step Verification and Exploration with Exoskeleton
Modules for Compositional Visual Reasoning
Paper
•
2408.02210
•
Published
•
7
Operationalizing Contextual Integrity in Privacy-Conscious Assistants
Paper
•
2408.02373
•
Published
•
4
LLaVA-OneVision: Easy Visual Task Transfer
Paper
•
2408.03326
•
Published
•
59
Diffusion Models as Data Mining Tools
Paper
•
2408.02752
•
Published
•
13
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual
Segmentation
Paper
•
2408.01708
•
Published
•
3
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in
Long-Horizon Tasks
Paper
•
2408.03615
•
Published
•
30
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware
Open-domain Visual Storytelling
Paper
•
2408.03695
•
Published
•
11
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
Paper
•
2408.03900
•
Published
•
9
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from
User's Casual Sketches
Paper
•
2408.04567
•
Published
•
23
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language
Models
Paper
•
2408.04594
•
Published
•
14
Puppet-Master: Scaling Interactive Video Generation as a Motion Prior
for Part-Level Dynamics
Paper
•
2408.04631
•
Published
•
8
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper
•
2408.05211
•
Published
•
46
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Paper
•
2408.04840
•
Published
•
31
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond
Scaling
Paper
•
2408.04810
•
Published
•
22
ControlNeXt: Powerful and Efficient Control for Image and Video
Generation
Paper
•
2408.06070
•
Published
•
52
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation
Agents
Paper
•
2408.06327
•
Published
•
13
UniPortrait: A Unified Framework for Identity-Preserving Single- and
Multi-Human Image Personalization
Paper
•
2408.05939
•
Published
•
13
Paper
•
2408.07009
•
Published
•
60
Amuro & Char: Analyzing the Relationship between Pre-Training and
Fine-Tuning of Large Language Models
Paper
•
2408.06663
•
Published
•
15
Paper
•
2408.05366
•
Published
•
10
Towards flexible perception with visual memory
Paper
•
2408.08172
•
Published
•
19
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
•
2408.08872
•
Published
•
96
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
Paper
•
2408.08459
•
Published
•
44
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning
Paper
•
2408.08441
•
Published
•
6
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
•
2408.10188
•
Published
•
51
MegaFusion: Extend Diffusion Models towards Higher-resolution Image
Generation without Further Tuning
Paper
•
2408.11001
•
Published
•
11
Factorized-Dreamer: Training A High-Quality Video Generator with Limited
and Low-Quality Data
Paper
•
2408.10119
•
Published
•
15
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model
Paper
•
2408.11039
•
Published
•
56
NeCo: Improving DINOv2's spatial representations in 19 GPU hours with
Patch Neighbor Consistency
Paper
•
2408.11054
•
Published
•
10
Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion
for Efficient Inference Intervention in Large Language Model
Paper
•
2408.10764
•
Published
•
7
Audio Match Cutting: Finding and Creating Matching Audio Transitions in
Movies and Videos
Paper
•
2408.10998
•
Published
•
7
MambaEVT: Event Stream based Visual Object Tracking using State Space
Model
Paper
•
2408.10487
•
Published
•
5
FocusLLM: Scaling LLM's Context by Parallel Decoding
Paper
•
2408.11745
•
Published
•
23
TWLV-I: Analysis and Insights from Holistic Evaluation on Video
Foundation Models
Paper
•
2408.11318
•
Published
•
54
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Paper
•
2408.11817
•
Published
•
7
FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive
Prompt Weighting
Paper
•
2408.11706
•
Published
•
5
TrackGo: A Flexible and Efficient Method for Controllable Video
Generation
Paper
•
2408.11475
•
Published
•
16
Out-of-Distribution Detection with Attention Head Masking for Multimodal
Document Classification
Paper
•
2408.11237
•
Published
•
4
Iterative Object Count Optimization for Text-to-image Diffusion Models
Paper
•
2408.11721
•
Published
•
4
Sapiens: Foundation for Human Vision Models
Paper
•
2408.12569
•
Published
•
86
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
•
2408.12528
•
Published
•
50
Open-FinLLMs: Open Multimodal Large Language Models for Financial
Applications
Paper
•
2408.11878
•
Published
•
49
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed
Representations
Paper
•
2408.12590
•
Published
•
33
Scalable Autoregressive Image Generation with Mamba
Paper
•
2408.12245
•
Published
•
23
Real-Time Video Generation with Pyramid Attention Broadcast
Paper
•
2408.12588
•
Published
•
13
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for
Large-scale Vision-Language Models
Paper
•
2408.12114
•
Published
•
11
Anim-Director: A Large Multimodal Model Powered Agent for Controllable
Animation Video Generation
Paper
•
2408.09787
•
Published
•
6
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
110
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution
Real-World Scenarios that are Difficult for Humans?
Paper
•
2408.13257
•
Published
•
25
CustomCrafter: Customized Video Generation with Preserving Motion and
Concept Composition Abilities
Paper
•
2408.13239
•
Published
•
10
Foundation Models for Music: A Survey
Paper
•
2408.14340
•
Published
•
38
LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!
Paper
•
2408.13402
•
Published
•
17
TVG: A Training-free Transition Video Generation Method with Diffusion
Models
Paper
•
2408.13413
•
Published
•
13
BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and
Deduplication by Introducing a Competitive Large Language Model Baseline
Paper
•
2408.15079
•
Published
•
51
Law of Vision Representation in MLLMs
Paper
•
2408.16357
•
Published
•
92
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
56
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio
Language Modeling
Paper
•
2408.16532
•
Published
•
45
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper
•
2408.16725
•
Published
•
50
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time
Series Forecasters
Paper
•
2408.17253
•
Published
•
35
TableBench: A Comprehensive and Complex Benchmark for Table Question
Answering
Paper
•
2408.09174
•
Published
•
51
VideoLLaMB: Long-context Video Understanding with Recurrent Memory
Bridges
Paper
•
2409.01071
•
Published
•
26
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world
Videos
Paper
•
2409.02095
•
Published
•
33
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper
•
2409.02097
•
Published
•
31
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
•
2409.02889
•
Published
•
54
Attention Heads of Large Language Models: A Survey
Paper
•
2409.03752
•
Published
•
85
Open-MAGVIT2: An Open-Source Project Toward Democratizing
Auto-regressive Visual Generation
Paper
•
2409.04410
•
Published
•
23
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper
•
2409.05840
•
Published
•
45
Towards a Unified View of Preference Learning for Large Language Models:
A Survey
Paper
•
2409.02795
•
Published
•
71
POINTS: Improving Your Vision-language Model with Affordable Strategies
Paper
•
2409.04828
•
Published
•
22
Benchmarking Chinese Knowledge Rectification in Large Language Models
Paper
•
2409.05806
•
Published
•
14
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Paper
•
2409.06666
•
Published
•
54
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Paper
•
2409.06135
•
Published
•
14
PingPong: A Benchmark for Role-Playing Language Models with User
Emulation and Multi-Model Evaluation
Paper
•
2409.06820
•
Published
•
58
MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View
Synthesis
Paper
•
2409.07129
•
Published
•
6
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Paper
•
2409.07239
•
Published
•
11
Ferret: Federated Full-Parameter Tuning at Scale for Large Language
Models
Paper
•
2409.06277
•
Published
•
14
Guiding Vision-Language Model Selection for Visual Question-Answering
Across Tasks, Domains, and Knowledge Types
Paper
•
2409.09269
•
Published
•
7
One missing piece in Vision and Language: A Survey on Comics
Understanding
Paper
•
2409.09502
•
Published
•
23
NVLM: Open Frontier-Class Multimodal LLMs
Paper
•
2409.11402
•
Published
•
55
OmniGen: Unified Image Generation
Paper
•
2409.11340
•
Published
•
78
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper
•
2409.11355
•
Published
•
26
OSV: One Step is Enough for High-Quality Image to Video Generation
Paper
•
2409.11367
•
Published
•
12
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
Document Understanding
Paper
•
2409.03420
•
Published
•
23
InstantDrag: Improving Interactivity in Drag-based Image Editing
Paper
•
2409.08857
•
Published
•
30
AudioBERT: Audio Knowledge Augmented Language Model
Paper
•
2409.08199
•
Published
•
4
LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study
Paper
•
2409.08554
•
Published
•
3
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
•
2409.12191
•
Published
•
68
Qwen2.5-Coder Technical Report
Paper
•
2409.12186
•
Published
•
117
Preference Tuning with Human Feedback on Language, Speech, and Vision
Tasks: A Survey
Paper
•
2409.11564
•
Published
•
18
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Paper
•
2409.12139
•
Published
•
11
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary
Resolution
Paper
•
2409.12961
•
Published
•
23
StoryMaker: Towards Holistic Consistent Characters in Text-to-image
Generation
Paper
•
2409.12576
•
Published
•
14
Imagine yourself: Tuning-Free Personalized Image Generation
Paper
•
2409.13346
•
Published
•
64
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating
Satire Comprehension capability of Vision-Language Models
Paper
•
2409.13592
•
Published
•
45
Portrait Video Editing Empowered by Multimodal Generative Priors
Paper
•
2409.13591
•
Published
•
15
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language
Instructions
Paper
•
2409.15278
•
Published
•
21
Phantom of Latent for Large Language and Vision Models
Paper
•
2409.14713
•
Published
•
26
Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror
Reflections
Paper
•
2409.14677
•
Published
•
14
MIMO: Controllable Character Video Synthesis with Spatial Decomposed
Modeling
Paper
•
2409.16160
•
Published
•
28
MonoFormer: One Transformer for Both Diffusion and Autoregression
Paper
•
2409.16280
•
Published
•
16
Seeing Faces in Things: A Model and Dataset for Pareidolia
Paper
•
2409.16143
•
Published
•
14
Attention Prompting on Image for Large Vision-Language Models
Paper
•
2409.17143
•
Published
•
4
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
•
2409.17146
•
Published
•
85
MIO: A Foundation Model on Multimodal Tokens
Paper
•
2409.17692
•
Published
•
40