vision foundation modesl - a plmsmile Collection

plmsmile 's Collections

vision foundation modesl

image-video llm

llm

video generation

mllm applications

vision foundation modesl

updated Jun 17

vision foundation models

ViTAR: Vision Transformer with Any Resolution

Paper • 2403.18361 • Published Mar 27 • 52
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10 • 18
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Paper • 2404.15653 • Published Apr 24 • 26
Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16 • 126
Many-Shot In-Context Learning in Multimodal Foundation Models

Paper • 2405.09798 • Published May 16 • 26
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Paper • 2405.15738 • Published May 24 • 43
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27 • 31
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13 • 50