Collections
Discover the best community collections!
Collections including paper arxiv:2409.18124
-
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
Paper • 2409.18124 • Published • 31 -
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Paper • 2409.18125 • Published • 33 -
Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices
Paper • 2410.11795 • Published • 16 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 85
-
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
Paper • 2409.02095 • Published • 35 -
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper • 2409.01704 • Published • 81 -
CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation
Paper • 2409.03643 • Published • 18 -
UniDet3D: Multi-dataset Indoor 3D Object Detection
Paper • 2409.04234 • Published • 7
-
Controllable Text Generation for Large Language Models: A Survey
Paper • 2408.12599 • Published • 62 -
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Paper • 2408.12590 • Published • 34 -
Real-Time Video Generation with Pyramid Attention Broadcast
Paper • 2408.12588 • Published • 15 -
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper • 2408.11039 • Published • 56
-
Depth Anything V2
Paper • 2406.09414 • Published • 92 -
Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation
Paper • 2406.12849 • Published • 49 -
BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation
Paper • 2407.17952 • Published • 29 -
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
Paper • 2409.18124 • Published • 31
-
LocalMamba: Visual State Space Model with Windowed Selective Scan
Paper • 2403.09338 • Published • 7 -
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Paper • 2403.09394 • Published • 25 -
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper • 2402.19479 • Published • 32 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 26