stabilityai/stable-diffusion-3-medium-diffusers Text-to-Image • Updated Jun 19 • 299k • • 355
M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning Paper • 2306.04387 • Published Jun 7, 2023 • 8
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition Paper • 2304.04704 • Published Apr 10, 2023
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond Paper • 2310.02071 • Published Oct 3, 2023 • 4
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding Paper • 2310.19060 • Published Oct 29, 2023
DCA: Diversified Co-Attention towards Informative Live Video Commenting Paper • 1911.02739 • Published Nov 7, 2019