Papers
arxiv:2408.00765

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

Published on Aug 1
· Submitted by whyu on Aug 2
Authors:
,
,

Abstract

MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0. Among open-weight models, InternVL2-Llama3-76B leads with a score of 68.4.

Community

Paper author Paper submitter
Paper author Paper submitter

This example makes me doubt whether large models like GPT-4o truly have intelligence.

Q: Which iron ball will land first, A or B?
GPT-4o: Both will land at the same time.
I: ???

iron_balls.jpg

·

It depends on your definition of intelligence. They certainly do not have the type of intelligence that you're alluding to. Nor should they. That type of intelligence will require a new groundbreaking architecture. All they do is sample from a probability distribution acquired during training. That they appear to mimic intelligence is simply down to the fact that they captured some of the "intelligence" inherent in the training data. Like a parrot, copying a person. It doesn't matter how good it is at impersonating a human, it will never understand the noises that it makes.

In the example above, during training the model learned that two items dropped will hit the ground at the same time. The image shows two items, therefore when dropped they should hit the ground at the same time. That's probability in action. It's doing what it was designed to do.

Paper author Paper submitter
edited Aug 5

Another interesting example is to test whether the model understands the "mirror" concept.

v2_500_0.jpg

Q: How many feet do these animals have in total?
GPT-4o:
The image shows:

  • Two rabbits: Each rabbit has 4 feet, so together they have 8 feet.
  • Two adult chickens: Each chicken has 2 feet, so together they have 4 feet.
  • One chick: The chick has 2 feet.

In total:

[ 8 \text{ (rabbits)} + 4 \text{ (adult chickens)} + 2 \text{ (chick)} = 14 \text{ feet} ]

So, the animals have 14 feet in total.

GT: 10

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2408.00765 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 3