dipta007
's Collections
MM-Interleaved: Interleaved Image-Text Generative Modeling via
Multi-modal Feature Synchronizer
Paper
•
2401.10208
•
Published
•
1
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
Modalities
Paper
•
2305.11172
•
Published
•
1
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video
Paper
•
2302.00402
•
Published
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
•
2308.12966
•
Published
•
7
Unified Model for Image, Video, Audio and Language Tasks
Paper
•
2307.16184
•
Published
•
14
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Paper
•
2307.13721
•
Published
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
Paper
•
2309.03895
•
Published
•
13
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks
Paper
•
2312.14238
•
Published
•
14
MMBench: Is Your Multi-modal Model an All-around Player?
Paper
•
2307.06281
•
Published
•
5
GPT4All: An Ecosystem of Open Source Compressed Language Models
Paper
•
2311.04931
•
Published
•
20
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
•
2403.05525
•
Published
•
39
nvidia/NVLM-D-72B
Image-Text-to-Text
•
Updated
•
15.8k
•
739
Qwen/Qwen2-VL-72B-Instruct-AWQ
Image-Text-to-Text
•
Updated
•
19.5k
•
35
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text
•
Updated
•
1.67M
•
•
840
Qwen/Qwen2-VL-72B-Instruct
Image-Text-to-Text
•
Updated
•
79.6k
•
172
HuggingFaceM4/Idefics3-8B-Llama3
Image-Text-to-Text
•
Updated
•
12.3k
•
238
mistralai/Pixtral-12B-2409
Updated
•
508
OpenGVLab/InternVL2-8B
Image-Text-to-Text
•
Updated
•
129k
•
148
OpenGVLab/InternVL2-4B
Image-Text-to-Text
•
Updated
•
35.1k
•
42
OpenGVLab/InternVL2-Llama3-76B
Image-Text-to-Text
•
Updated
•
176k
•
205