Collections
Discover the best community collections!
Collections including paper arxiv:2409.11340
-
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher
Paper • 2408.14176 • Published • 59 -
Diffusion Models Are Real-Time Game Engines
Paper • 2408.14837 • Published • 121 -
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper • 2408.11039 • Published • 56 -
OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model
Paper • 2409.01199 • Published • 12
-
What matters when building vision-language models?
Paper • 2405.02246 • Published • 98 -
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
Paper • 2406.18790 • Published • 33 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 117 -
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Paper • 2408.12528 • Published • 50
-
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Paper • 2407.08083 • Published • 27 -
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper • 2408.11039 • Published • 56 -
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
Paper • 2408.15237 • Published • 36 -
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper • 2409.11355 • Published • 28
-
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
Paper • 2406.18790 • Published • 33 -
OmniGen: Unified Image Generation
Paper • 2409.11340 • Published • 106 -
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Paper • 2408.12528 • Published • 50 -
MonoFormer/MonoFormer_ImageNet_256
Updated • 31 • 4
-
SelfEval: Leveraging the discriminative nature of generative models for evaluation
Paper • 2311.10708 • Published • 14 -
OmniGen: Unified Image Generation
Paper • 2409.11340 • Published • 106 -
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 71 -
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper • 2409.11355 • Published • 28
-
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Paper • 2311.17049 • Published -
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Paper • 2405.04434 • Published • 13 -
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Paper • 2303.17376 • Published -
Sigmoid Loss for Language Image Pre-Training
Paper • 2303.15343 • Published • 4
-
EdgeFusion: On-Device Text-to-Image Generation
Paper • 2404.11925 • Published • 21 -
Dynamic Typography: Bringing Words to Life
Paper • 2404.11614 • Published • 43 -
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
Paper • 2404.07987 • Published • 47 -
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models
Paper • 2404.07724 • Published • 12
-
LocalMamba: Visual State Space Model with Windowed Selective Scan
Paper • 2403.09338 • Published • 7 -
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Paper • 2403.09394 • Published • 25 -
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper • 2402.19479 • Published • 32 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 26
-
FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation
Paper • 2403.06775 • Published • 3 -
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Paper • 2010.11929 • Published • 6 -
Data Incubation -- Synthesizing Missing Data for Handwriting Recognition
Paper • 2110.07040 • Published • 2 -
A Mixture of Expert Approach for Low-Cost Customization of Deep Neural Networks
Paper • 1811.00056 • Published • 2