arxiv:2407.06581

Vision language models are blind

Published on Jul 9

· Submitted by

taesiri on Jul 10

#1 Paper of the day

Upvote

Authors:

Pooyan Rahmanzadehgervi ,

Logan Bolton ,

Mohammad Reza Taesiri ,

Anh Totti Nguyen

Abstract

Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. Yet, we find that VLMs fail on 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting the number of circles in a Olympic-like logo. The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like of a person with myopia seeing fine details as blurry, and at worst, like an intelligent person that is blind making educated guesses. Code is available at: https://vlmsareblind.github.io/

View arXiv page View PDF Add to collection

Community

taesiri

Paper author Paper submitter Jul 10

Project page: https://vlmsareblind.github.io/

taesiri

Paper author Paper submitter Jul 10

This paper examines the limitations of current vision-based language models, such as GPT-4 and Sonnet 3.5, in performing low-level vision tasks. Despite their high scores on numerous multimodal benchmarks, these models often fail on very basic cases. This raises a crucial question: are we evaluating these models accurately?

TashaSkyUp

Jul 10

Wonderful! I'm so glad to see these flaws being pointed out in a paper! Thank toy for your work on this!

GaggiX

Jul 10

Such clickbait titles on papers I do find them honestly quite cringe, in particular with models like Claude 3.5 Sonnet that performs much better than random on almost all the test.

Also the comparison between AI vision and myopia makes no sense as these images are not evaluating eyesight but the abstract capabilities of the models.

anhng8

Paper author Jul 17

Thank you for your feedback!

I agree that blindness and myopia are the terms defined for human vision. How AIs "see" the image is different from how humans see.

FYI. Dhruv Batra recently also called VLMs "nearly blind".
https://x.com/DhruvBatraDB/status/1778447178262040850

Perhaps we all should use AI blindness to avoid misinterpretation.

imjb

Jul 11

interesting read;
could this account, in part, account for the substandard lip-reading capability of LLMs when applied to speech recognition...?

What's the state of the art with accurate lip-reading llm application? Anyone?

KT313

Jul 11

regarding C1, it would be nice if more than 150 samples would have been used since they are pretty easy to generate anyway. like 1k images per model or something like that to better filter out some statistical noise. otherwise a difference of 1-3 points can just as well be random fluctuation (specifically looking at gemini-1.5 vs sonnet-3 in C2).

anhng8

Paper author Jul 17

Thank you for your feedback!

We're re-generating the images (large sample size and less ambiguity) for the "counting two-line intersections" task in light of your suggestions and will update the paper.

search-facility

Jul 11

Wow, so captcha still has some chances against AIs... This is good.
Overlapping circles everyone! :)

Williford

Jul 11

This is very interesting. I recently wrote a short post on my thoughts of the comparison of VLMs and how the brain works. My conclusion was that since these models lack the recurrence and bidirectionality that the brain has, the processing of VLMs is similar to the feedforward pre-attentive processing that has also been studied in the neuroscience and psychology literature. Another way to think of it is VLMs currently have to produce encodings for any relevant task, while our brains can interact with the visual hierarchy to produce "encodings" that a relevant for the task at hand. https://medium.com/towards-data-science/clip-llava-and-the-brain-2073dfb33d7e

I've tried some images from the vision sciences on VLMs and found that they do struggle on some of the images, but not as reliably as your result.

anhng8

Paper author Jul 17

Thank you for your article! Interesting!

Your feedback/recurrence point is indeed very similar to my hypothesis here:
https://x.com/anh_ng8/status/1813311161754144905

IMO, a high-level problem is that the granularity of visual representations extracted or model visual attention should be based on the prompt. Yet, most open-source models first extract visual representations without using the prompt and then fuse them with the text tokens.

zwq2018

Jul 11

Very interesting. This article profoundly counters the fervent arms race of multimodal model development. What kind of multimodal model do we truly need?

Recently, we also published a paper (Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model) addressing similar issues. We proposed the concept of abstract images: while current multimodal models perform well on conventional semantically related images, their understanding of abstract images, such as clocks, maps, layouts, and flowcharts, remains very rudimentary. Therefore, we constructed a large abstract image benchmark through self-instruct and code, and evaluated the current multimodal models. Our results are similar to those in this article, showing that even the most advanced multimodal models fail at some very simple tasks.

https://arxiv.org/abs/2407.07053
Code: https://github.com/zwq2018/Multi-modal-Self-instruct
Our Leaderboard: https://multi-modal-self-instruct.github.io/
Our Dataset: https://huggingface.co/datasets/zwq2018/Multi-modal-Self-instruct

anhng8

Paper author Jul 17

Thank you for sharing!

This very interesting and relevant work! :)

mattbarr

Jul 23

Interesting paper. VLMs have a long way to go and I'm glad there's on-the-record research documenting this now.

Unfortunately, it is TOTALLY unnecessary to use real-life human disabilities to critique a VLM negatively - it doesn't add any descriptive power and reinforces ableist stereotypes.

anhng8

Paper author Jul 26

Thank you for your feedback!

We had simply thought human vision <> computer vision, and, human blindness <> computer blindness.
Regardless, we've now removed a lot of mentions of "myopia" and "blindness" in the paper.