Papers
arxiv:2411.15296

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Published on Nov 22
· Submitted by yifanzhang114 on Nov 27
Authors:
,
,
,
,
,
,
,
,

Abstract

As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building upon pre-trained LLMs, this family of models further develops multimodal perception and reasoning capabilities that are impressive, such as writing code given a flow chart or creating stories based on an image. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. Distinct from the traditional train-eval-test paradigm that only favors a single task like image classification, the versatility of MLLMs has spurred the rise of various new benchmarks and evaluation methods. In this paper, we aim to present a comprehensive survey of MLLM evaluation, discussing four key aspects: 1) the summarised benchmarks types divided by the evaluation capabilities, including foundation capabilities, model self-analysis, and extented applications; 2) the typical process of benchmark counstruction, consisting of data collection, annotation, and precautions; 3) the systematic evaluation manner composed of judge, metric, and toolkit; 4) the outlook for the next benchmark. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods, thereby driving the progress of MLLM research.

Community

Paper author Paper submitter

As the importance of MLLMs in the field of artificial intelligence continues to increase, how to effectively evaluate the capabilities of these models has become a key issue. Traditional evaluation methods mostly focus on a single task, while the diversity and complexity of MLLMs require a more comprehensive evaluation framework. Multiple teams of multimodal large models, MME-Team (representative works MME, Video-MME, and MME-RealWorld), MMBench Team (representative works MMBench, MMbench-Video, etc.), and LLaVA team (representative works LLaVA-Next, LLaVA-OV, etc.) recently proposed a comprehensive evaluation review to fill this gap and provide researchers with a systematic guide.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.15296 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.15296 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.15296 in a Space README.md to link it from this page.

Collections including this paper 2