Learning Vision from Models Rivals Learning Vision from Data Paper • 2312.17742 • Published Dec 28, 2023 • 15
Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models Paper • 2312.17661 • Published Dec 29, 2023 • 13
Boosting Large Language Model for Speech Synthesis: An Empirical Study Paper • 2401.00246 • Published Dec 30, 2023 • 10
A Comprehensive Study of Knowledge Editing for Large Language Models Paper • 2401.01286 • Published Jan 2 • 16
DocLLM: A layout-aware generative language model for multimodal document understanding Paper • 2401.00908 • Published Dec 31, 2023 • 178
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers Paper • 2401.01974 • Published Jan 3 • 5
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model Paper • 2401.02330 • Published Jan 4 • 14
DiarizationLM: Speaker Diarization Post-Processing with Large Language Models Paper • 2401.03506 • Published Jan 7 • 13
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation Paper • 2401.04092 • Published Jan 8 • 20
ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of Video Paper • 2401.05314 • Published Jan 10 • 9
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models Paper • 2401.05252 • Published Jan 10 • 45
Improving fine-grained understanding in image-text pre-training Paper • 2401.09865 • Published Jan 18 • 15
Rethinking FID: Towards a Better Evaluation Metric for Image Generation Paper • 2401.09603 • Published Nov 30, 2023 • 15
Understanding Video Transformers via Universal Concept Discovery Paper • 2401.10831 • Published Jan 19 • 7
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads Paper • 2401.10774 • Published Jan 19 • 53
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities Paper • 2401.12168 • Published Jan 22 • 24
Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study Paper • 2401.12789 • Published Jan 23 • 6
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models Paper • 2401.13311 • Published Jan 24 • 10
MM-LLMs: Recent Advances in MultiModal Large Language Models Paper • 2401.13601 • Published Jan 24 • 44
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models Paper • 2401.15947 • Published Jan 29 • 48
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities Paper • 2402.01831 • Published Feb 2 • 13
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Paper • 2402.03161 • Published Feb 5 • 14
Rethinking Interpretability in the Era of Large Language Models Paper • 2402.01761 • Published Jan 30 • 21
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations Paper • 2402.04236 • Published Feb 6 • 7
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models Paper • 2402.03749 • Published Feb 6 • 12
Scaling Laws for Downstream Task Performance of Large Language Models Paper • 2402.04177 • Published Feb 6 • 17
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model Paper • 2402.03766 • Published Feb 6 • 12
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Paper • 2402.05935 • Published Feb 8 • 15
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like Paper • 2402.07383 • Published Feb 12 • 13
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models Paper • 2402.07865 • Published Feb 12 • 12
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs Paper • 2402.07872 • Published Feb 12 • 15
Lumos : Empowering Multimodal LLMs with Scene Text Recognition Paper • 2402.08017 • Published Feb 12 • 24
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data Paper • 2402.08093 • Published Feb 12 • 54
World Model on Million-Length Video And Language With RingAttention Paper • 2402.08268 • Published Feb 13 • 36
Premise Order Matters in Reasoning with Large Language Models Paper • 2402.08939 • Published Feb 14 • 24
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows Paper • 2402.10379 • Published Feb 16 • 29
LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing Paper • 2402.10294 • Published Feb 15 • 22
A Touch, Vision, and Language Dataset for Multimodal Alignment Paper • 2402.13232 • Published Feb 20 • 13
VideoPrism: A Foundational Visual Encoder for Video Understanding Paper • 2402.13217 • Published Feb 20 • 21
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting Paper • 2402.13720 • Published Feb 21 • 5
TinyLLaVA: A Framework of Small-scale Large Multimodal Models Paper • 2402.14289 • Published Feb 22 • 19
Sora Generates Videos with Stunning Geometrical Consistency Paper • 2402.17403 • Published Feb 27 • 16
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Paper • 2402.19479 • Published Feb 29 • 32
Learning and Leveraging World Models in Visual Representation Learning Paper • 2403.00504 • Published Mar 1 • 29
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies Paper • 2403.01422 • Published Mar 3 • 26