arxiv:2411.13281

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Published on Nov 20

· Submitted by

teowu on Nov 21

#3 Paper of the day

Upvote

Authors:

Ziyang Luo ,

Haoning Wu ,

Abstract

Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.

View arXiv page View PDF Add to collection

Community

teowu

Paper author Paper submitter about 12 hours ago

Visit our project page on https://videoautoarena.github.io!

Code and dataset coming soon!

teowu

Paper author Paper submitter about 12 hours ago

We choose top-10 models (w/ their smaller-size variants) on Video-MME (cutoff 15 Oct 24) as arena players, and here are their Arena Elo results, suggesting a larger gap on user-faced video analysis than video MCQs.

Models	Size	Frames	ELO	Win Rates	(8s, 15s)	(15s, 60s)	(180s, 600s)	(900s, 3600s)
GPT-4o	-	64	1505.7	89.2	1447.9	1449.6	1575.3	1552.2
GPT-4o-mini	-	64	1323.3	76.9	1293.3	1343.3	1327.8	1349.3
Gemini-1.5-Pro	-	64	1187.0	65.1	1247.7	1171.8	1263.6	1291.6
Gemini-1.5-Flash	-	64	1149.5	62.1	1081.6	1131.3	1140.1	1260.4
Aria	8×3.5B	64	1120.0	59.5	1147.5	1273.8	1110.7	1111.4
Qwen2-VL	72B	64	886.5	35.6	985.5	928.2	829.6	826.6
Qwen2-VL	7B	64	875.6	34.9	969.3	859.3	850.3	829.2
LLaVA-Video	72B	64	836.6	30.3	796.9	850.1	827.9	782.5
LLaVA-Video	7B	64	765.6	23.5	672.4	736.1	759.1	721.8
LLaVA-OneVision	72B	64	763.7	23.1	731.5	710.6	759.3	741.8
LLaVA-OneVision	7B	64	586.5	9.9	626.7	545.8	556.3	533.2