Papers
arxiv:2410.19100

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Published on Oct 24
· Submitted by ljang0 on Oct 29
Authors:
,
,
,
,
,
,
,

Abstract

Videos are often used to learn or extract the necessary information to complete tasks in ways different than what text and static imagery alone can provide. However, many existing agent benchmarks neglect long-context video understanding, instead focusing on text or static image inputs. To bridge this gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. VideoWA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content. For our benchmark, we define a taxonomy of long-context video-based agent tasks with two main areas of focus: skill retention and factual retention. While skill retention tasks evaluate whether an agent can use a given human demonstration to complete a task efficiently, the factual retention task evaluates whether an agent can retrieve instruction-relevant information from a video to complete a task. We find that the best model achieves 13.3% success on factual retention tasks and 45.8% on factual retention QA pairs, far below human performance at 73.9% and 79.3%, respectively. On skill retention tasks, long-context models perform worse with tutorials than without, exhibiting a 5% performance decrease in WebArena tasks and a 10.3% decrease in VisualWebArena tasks. Our work highlights the need to improve the agentic abilities of long-context multimodal models and provides a testbed for future development with long-context video agents.

Community

Paper submitter

New agent benchmarks for OS and website control along with long-context capabilities are pushing the frontiers of what LLMs and VLMs can achieve. Can long-context multimodal models take relevant video input and also perform agentic actions?

What if your web agent could watch you perform your tasks, and learn to perform tasks with your personal preferences and behaviors? Really excited to introduce VideoWebArena, a new benchmark for evaluating long-context multimodal agents with video-based web tasks.

📃Paper: https://arxiv.org/abs/2410.19100
🌐Website: https://videowebarena.github.io
💻Code: https://github.com/ljang0/videowebarena

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.19100 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.19100 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.19100 in a Space README.md to link it from this page.

Collections including this paper 3