Papers
arxiv:2405.02246

What matters when building vision-language models?

Published on May 3
ยท Submitted by akhaliq on May 14
#1 Paper of the day
Authors:
,

Abstract

The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.

Community

Here's a plain-english summary of the paper - feedback welcome!

https://www.aimodels.fyi/papers/arxiv/what-matters-when-building-vision-language-models

What Matters Most in Vision-Language Models?

Links ๐Ÿ”—:

๐Ÿ‘‰ Subscribe: https://www.youtube.com/@Arxflix
๐Ÿ‘‰ Twitter: https://x.com/arxflix
๐Ÿ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

IMG_20240826_135735.jpg

I'm looking for feedback on my plans for a CLIP-like training dataset which pairs images with a longer JSON annotation. The difference is instead of a short caption, my hypothesis is the pretraining model can extract more from the JSON fields. See an example image and JSON from my prototype:

Sign up or log in to comment

Models citing this paper 9

Browse 9 models citing this paper

Datasets citing this paper 2

Spaces citing this paper 110

Collections including this paper 26