Papers
arxiv:2411.13281

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Published on Nov 20
· Submitted by teowu on Nov 21
#3 Paper of the day
Authors:
,
,
,

Abstract

Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.

Community

Paper author Paper submitter

Visit our project page on https://videoautoarena.github.io!

Code and dataset coming soon!

Paper author Paper submitter

We choose top-10 models (w/ their smaller-size variants) on Video-MME (cutoff 15 Oct 24) as arena players, and here are their Arena Elo results, suggesting a larger gap on user-faced video analysis than video MCQs.

Models Size Frames ELO Win Rates (8s, 15s) (15s, 60s) (180s, 600s) (900s, 3600s)
GPT-4o - 64 1505.7 89.2 1447.9 1449.6 1575.3 1552.2
GPT-4o-mini - 64 1323.3 76.9 1293.3 1343.3 1327.8 1349.3
Gemini-1.5-Pro - 64 1187.0 65.1 1247.7 1171.8 1263.6 1291.6
Gemini-1.5-Flash - 64 1149.5 62.1 1081.6 1131.3 1140.1 1260.4
Aria 8×3.5B 64 1120.0 59.5 1147.5 1273.8 1110.7 1111.4
Qwen2-VL 72B 64 886.5 35.6 985.5 928.2 829.6 826.6
Qwen2-VL 7B 64 875.6 34.9 969.3 859.3 850.3 829.2
LLaVA-Video 72B 64 836.6 30.3 796.9 850.1 827.9 782.5
LLaVA-Video 7B 64 765.6 23.5 672.4 736.1 759.1 721.8
LLaVA-OneVision 72B 64 763.7 23.1 731.5 710.6 759.3 741.8
LLaVA-OneVision 7B 64 586.5 9.9 626.7 545.8 556.3 533.2

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.13281 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.13281 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.13281 in a Space README.md to link it from this page.

Collections including this paper 1