Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
Abstract
We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.
Community
Hey, Im reviewing deep learning papers on twitter daily in Hebrew via hashtag #https://twitter.com/hashtag/shorthebrewpapereviews?src=hashtag_click. So far I've shortly reviewed about deep learning papers. You are invited to follow and comment
This paper review can be found at: https://twitter.com/MikeE_3_14/status/1675088525237051394?s=20
There are actually a lot of large language vision models out there already.
NVM this is actually pretty novel.
LENS: The Future of Computer Vision with Language Models!
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper