Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2411.14347

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3 • 32
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17 • 25
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27 • 121
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17 • 21

MutiModal_Paper

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17 • 53
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Paper • 2411.07975 • Published 14 days ago • 26
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published 11 days ago • 58
Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published 5 days ago • 36

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1 • 108
Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Paper • 2408.07416 • Published Aug 14 • 6
SMITE: Segment Me In TimE

Paper • 2410.18538 • Published Oct 24 • 15
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

Paper • 2410.23287 • Published 27 days ago • 17

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3 • 100
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 85
DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

Paper • 2405.19707 • Published May 30 • 5
Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations

Paper • 2410.08049 • Published Oct 10 • 8

about 1 hour ago

LocalMamba: Visual State Space Model with Windowed Selective Scan

Paper • 2403.09338 • Published Mar 14 • 7
GiT: Towards Generalist Vision Transformer through Universal Language Interface

Paper • 2403.09394 • Published Mar 14 • 25
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Paper • 2402.19479 • Published Feb 29 • 32
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Paper • 2405.10300 • Published May 16 • 26

Company

© Hugging Face

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs