OminiControl: Minimal and Universal Control for Diffusion Transformer Paper • 2411.15098 • Published 6 days ago • 38
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Paper • 2411.13281 • Published 8 days ago • 17
Allegro: Open the Black Box of Commercial-Level Video Generation Model Paper • 2410.15458 • Published Oct 20 • 40
Aria: An Open Multimodal Native Mixture-of-Experts Model Paper • 2410.05993 • Published Oct 8 • 107
Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison Paper • 1910.11006 • Published Oct 24, 2019
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning Paper • 2311.18799 • Published Nov 30, 2023 • 1
Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions Paper • 2401.01827 • Published Jan 3 • 15
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Paper • 2201.12086 • Published Jan 28, 2022 • 3
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Paper • 2301.12597 • Published Jan 30, 2023 • 1
Aria: An Open Multimodal Native Mixture-of-Experts Model Paper • 2410.05993 • Published Oct 8 • 107
Align and Prompt: Video-and-Language Pre-training with Entity Prompts Paper • 2112.09583 • Published Dec 17, 2021