arxiv:2408.00765

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

Published on Aug 1

· Submitted by

whyu on Aug 2

Upvote

Authors:

Weihao Yu ,

Zhengyuan Yang ,

Linjie Li ,

Jianfeng Wang ,

Kevin Lin ,

Zicheng Liu ,

Lijuan Wang ,

Xinchao Wang

Abstract

MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0. Among open-weight models, InternVL2-Llama3-76B leads with a score of 68.4.

View arXiv page View PDF Add to collection

Community

whyu

Paper author Paper submitter Aug 2

Code & data: https://github.com/yuweihao/MM-Vet
Online evaluator: https://huggingface.co/spaces/whyu/MM-Vet-v2_Evaluator

whyu

Paper author Paper submitter Aug 2

This example makes me doubt whether large models like GPT-4o truly have intelligence.

Q: Which iron ball will land first, A or B?
GPT-4o: Both will land at the same time.
I: ???

MichaelBarryUK

Aug 2

•

edited Aug 2

It depends on your definition of intelligence. They certainly do not have the type of intelligence that you're alluding to. Nor should they. That type of intelligence will require a new groundbreaking architecture. All they do is sample from a probability distribution acquired during training. That they appear to mimic intelligence is simply down to the fact that they captured some of the "intelligence" inherent in the training data. Like a parrot, copying a person. It doesn't matter how good it is at impersonating a human, it will never understand the noises that it makes.

In the example above, during training the model learned that two items dropped will hit the ground at the same time. The image shows two items, therefore when dropped they should hit the ground at the same time. That's probability in action. It's doing what it was designed to do.