Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments Paper • 2409.05865 • Published 25 days ago • 14
Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models Paper • 2408.15915 • Published Aug 28 • 19
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model Paper • 2408.11039 • Published Aug 20 • 56
LongVILA: Scaling Long-Context Visual Language Models for Long Videos Paper • 2408.10188 • Published Aug 19 • 51
TacSL: A Library for Visuotactile Sensor Simulation and Learning Paper • 2408.06506 • Published Aug 12 • 7
Can Large Language Models Understand Symbolic Graphics Programs? Paper • 2408.08313 • Published Aug 15 • 6
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models Paper • 2408.04594 • Published Aug 8 • 14
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Paper • 2407.03320 • Published Jul 3 • 92
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation Paper • 2407.02371 • Published Jul 2 • 49
InstantStyle-Plus: Style Transfer with Content-Preserving in Text-to-Image Generation Paper • 2407.00788 • Published Jun 30 • 22
T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching Paper • 2402.14167 • Published Feb 21 • 10
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping Paper • 2402.14083 • Published Feb 21 • 43
BeTAIL: Behavior Transformer Adversarial Imitation Learning from Human Racing Gameplay Paper • 2402.14194 • Published Feb 22 • 5
CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation Paper • 2402.14795 • Published Feb 22 • 5
Agile But Safe: Learning Collision-Free High-Speed Legged Locomotion Paper • 2401.17583 • Published Jan 31 • 25
Adaptive Mobile Manipulation for Articulated Objects In the Open World Paper • 2401.14403 • Published Jan 25 • 9
AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents Paper • 2401.12963 • Published Jan 23 • 12
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models Paper • 2402.17177 • Published Feb 27 • 88
ResAdapter: Domain Consistent Resolution Adapter for Diffusion Models Paper • 2403.02084 • Published Mar 4 • 14
CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion Paper • 2403.05121 • Published Mar 8 • 20
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment Paper • 2403.05135 • Published Mar 8 • 42
TinyLLaVA: A Framework of Small-scale Large Multimodal Models Paper • 2402.14289 • Published Feb 22 • 19
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis Paper • 2402.14797 • Published Feb 22 • 19
GeneOH Diffusion: Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion Paper • 2402.14810 • Published Feb 22 • 8
Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition Paper • 2402.15504 • Published Feb 23 • 21
CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models Paper • 2402.15021 • Published Feb 22 • 12
Seamless Human Motion Composition with Blended Positional Encodings Paper • 2402.15509 • Published Feb 23 • 14
RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models Paper • 2402.12908 • Published Feb 20 • 7
Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation Paper • 2402.10491 • Published Feb 16 • 16
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter Paper • 2402.10896 • Published Feb 16 • 14
BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models Paper • 2401.13974 • Published Jan 25 • 12
Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support Paper • 2401.14688 • Published Jan 26 • 13
AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning Paper • 2402.00769 • Published Feb 1 • 20
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models Paper • 2401.15947 • Published Jan 29 • 48
Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance Paper • 2401.15687 • Published Jan 28 • 21
Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding Paper • 2401.15708 • Published Jan 28 • 10
Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation Paper • 2401.15688 • Published Jan 28 • 11
Lumiere: A Space-Time Diffusion Model for Video Generation Paper • 2401.12945 • Published Jan 23 • 86
UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion Paper • 2401.13388 • Published Jan 24 • 10
Small Language Model Meets with Reinforced Vision Vocabulary Paper • 2401.12503 • Published Jan 23 • 31
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution Paper • 2401.10404 • Published Jan 18 • 10
Scaling Face Interaction Graph Networks to Real World Scenes Paper • 2401.11985 • Published Jan 22 • 2
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs Paper • 2401.11708 • Published Jan 22 • 29
En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data Paper • 2401.01173 • Published Jan 2 • 11
VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM Paper • 2401.01256 • Published Jan 2 • 19
AGG: Amortized Generative 3D Gaussians for Single Image to 3D Paper • 2401.04099 • Published Jan 8 • 8
Q-Refine: A Perceptual Quality Refiner for AI-Generated Image Paper • 2401.01117 • Published Jan 2 • 8
TrailBlazer: Trajectory Control for Diffusion-Based Video Generation Paper • 2401.00896 • Published Dec 31, 2023 • 14
Taming Mode Collapse in Score Distillation for Text-to-3D Generation Paper • 2401.00909 • Published Dec 31, 2023 • 9
Make-A-Character: High Quality Text-to-3D Character Generation within Minutes Paper • 2312.15430 • Published Dec 24, 2023 • 28
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise Paper • 2312.12436 • Published Dec 19, 2023 • 13
Clockwork Diffusion: Efficient Generation With Model-Step Distillation Paper • 2312.08128 • Published Dec 13, 2023 • 12