MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning Paper • 2409.20566 • Published 2 days ago • 33
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper • 2409.17146 • Published 7 days ago • 87
RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning Paper • 2409.14674 • Published 10 days ago • 40
Training Language Models to Self-Correct via Reinforcement Learning Paper • 2409.12917 • Published 13 days ago • 127
OSV: One Step is Enough for High-Quality Image to Video Generation Paper • 2409.11367 • Published 15 days ago • 12
LLaMA-Omni: Seamless Speech Interaction with Large Language Models Paper • 2409.06666 • Published 22 days ago • 54
view article Article Getty Images Brings High-Quality, Commercially Safe Dataset to Hugging Face By andreagagliano • 26 days ago • 15
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture Paper • 2409.02889 • Published 28 days ago • 54
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency Paper • 2409.02634 • Published 28 days ago • 85
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming Paper • 2408.16725 • Published Aug 29 • 50
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders Paper • 2408.15998 • Published Aug 28 • 83
The Mamba in the Llama: Distilling and Accelerating Hybrid Models Paper • 2408.15237 • Published Aug 27 • 36
MobileQuant: Mobile-friendly Quantization for On-device Language Models Paper • 2408.13933 • Published Aug 25 • 13
MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement Paper • 2408.14211 • Published Aug 26 • 8
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? Paper • 2408.13257 • Published Aug 23 • 25
Real-Time Video Generation with Pyramid Attention Broadcast Paper • 2408.12588 • Published Aug 22 • 13
DreamCinema: Cinematic Transfer with Free Camera and 3D Character Paper • 2408.12601 • Published Aug 22 • 28
LongVILA: Scaling Long-Context Visual Language Models for Long Videos Paper • 2408.10188 • Published Aug 19 • 51
BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion Paper • 2408.04785 • Published Aug 8 • 6
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 Paper • 2408.05147 • Published Aug 9 • 36
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining Paper • 2408.02657 • Published Aug 5 • 32
Gemma 2: Improving Open Language Models at a Practical Size Paper • 2408.00118 • Published Jul 31 • 73
Tora: Trajectory-oriented Diffusion Transformer for Video Generation Paper • 2407.21705 • Published Jul 31 • 25
Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning Paper • 2407.20798 • Published Jul 30 • 23
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents Paper • 2407.18901 • Published Jul 26 • 31
Wolf: Captioning Everything with a World Summarization Framework Paper • 2407.18908 • Published Jul 26 • 30
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model Paper • 2407.16982 • Published Jul 24 • 40
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence Paper • 2407.16655 • Published Jul 23 • 28
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore Paper • 2407.12854 • Published Jul 9 • 29
Click-Gaussian: Interactive Segmentation to Any 3D Gaussians Paper • 2407.11793 • Published Jul 16 • 3
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients Paper • 2407.08296 • Published Jul 11 • 31
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation Paper • 2407.06135 • Published Jul 8 • 20
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Paper • 2407.04051 • Published Jul 4 • 35
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Paper • 2406.08085 • Published Jun 12 • 13
Revealing Fine-Grained Values and Opinions in Large Language Models Paper • 2406.19238 • Published Jun 27 • 14
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention Paper • 2407.02490 • Published Jul 2 • 23
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation Paper • 2407.02371 • Published Jul 2 • 49
GaussianDreamerPro: Text to Manipulable 3D Gaussians with Highly Enhanced Quality Paper • 2406.18462 • Published Jun 26 • 11
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published Jun 27 • 51
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Paper • 2406.18521 • Published Jun 26 • 25
A Closer Look into Mixture-of-Experts in Large Language Models Paper • 2406.18219 • Published Jun 26 • 15
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published Jun 24 • 55
nabla^2DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials Paper • 2406.14347 • Published Jun 20 • 99
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing Paper • 2406.10601 • Published Jun 15 • 65
Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts Paper • 2406.12034 • Published Jun 17 • 13