Collections
Discover the best community collections!
Collections including paper arxiv:2409.11340
-
What matters when building vision-language models?
Paper • 2405.02246 • Published • 98 -
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
Paper • 2406.18790 • Published • 33 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 109 -
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Paper • 2408.12528 • Published • 50
-
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher
Paper • 2408.14176 • Published • 58 -
Diffusion Models Are Real-Time Game Engines
Paper • 2408.14837 • Published • 119 -
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper • 2408.11039 • Published • 54 -
OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model
Paper • 2409.01199 • Published • 10
-
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Paper • 2407.08083 • Published • 27 -
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper • 2408.11039 • Published • 54 -
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Paper • 2408.15237 • Published • 36 -
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper • 2409.11355 • Published • 24