MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
Abstract
MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0. Among open-weight models, InternVL2-Llama3-76B leads with a score of 68.4.
Community
Code & data: https://github.com/yuweihao/MM-Vet
Online evaluator: https://huggingface.co/spaces/whyu/MM-Vet-v2_Evaluator
This example makes me doubt whether large models like GPT-4o truly have intelligence.
Q: Which iron ball will land first, A or B?
GPT-4o: Both will land at the same time.
I: ???
It depends on your definition of intelligence. They certainly do not have the type of intelligence that you're alluding to. Nor should they. That type of intelligence will require a new groundbreaking architecture. All they do is sample from a probability distribution acquired during training. That they appear to mimic intelligence is simply down to the fact that they captured some of the "intelligence" inherent in the training data. Like a parrot, copying a person. It doesn't matter how good it is at impersonating a human, it will never understand the noises that it makes.
In the example above, during training the model learned that two items dropped will hit the ground at the same time. The image shows two items, therefore when dropped they should hit the ground at the same time. That's probability in action. It's doing what it was designed to do.
Another interesting example is to test whether the model understands the "mirror" concept.
Q: How many feet do these animals have in total?
GPT-4o:
The image shows:
- Two rabbits: Each rabbit has 4 feet, so together they have 8 feet.
- Two adult chickens: Each chicken has 2 feet, so together they have 4 feet.
- One chick: The chick has 2 feet.
In total:
[ 8 \text{ (rabbits)} + 4 \text{ (adult chickens)} + 2 \text{ (chick)} = 14 \text{ feet} ]
So, the animals have 14 feet in total.
GT: 10
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MIBench: Evaluating Multimodal Large Language Models over Multiple Images (2024)
- MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs (2024)
- MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs (2024)
- MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity (2024)
- MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper