LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Paper • 2407.07895 • Published Jul 10 • 40
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? Paper • 2407.04842 • Published Jul 5 • 52
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Paper • 2406.04325 • Published Jun 6 • 72
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection Paper • 2405.10300 • Published May 16 • 26
Compositional Text-to-Image Generation with Dense Blob Representations Paper • 2405.08246 • Published May 14 • 12
CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion Paper • 2403.05121 • Published Mar 8 • 22
Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts Paper • 2403.08268 • Published Mar 13 • 15
Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support Paper • 2401.14688 • Published Jan 26 • 13
MM-LLMs: Recent Advances in MultiModal Large Language Models Paper • 2401.13601 • Published Jan 24 • 45
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs Paper • 2311.09257 • Published Nov 14, 2023 • 45
De-Diffusion Makes Text a Strong Cross-Modal Interface Paper • 2311.00618 • Published Nov 1, 2023 • 21
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory Paper • 2308.08089 • Published Aug 16, 2023 • 21
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning Paper • 2307.04725 • Published Jul 10, 2023 • 64
JourneyDB: A Benchmark for Generative Image Understanding Paper • 2307.00716 • Published Jul 3, 2023 • 19