HumanEval-V (HumanEval-V)

HumanEval-V: A Lightweight Visual Understanding and Reasoning Benchmark for Evaluating LMMs through Coding Tasks

📄 Paper • 🏠 Home Page • 💻 GitHub Repository • 🏆 Leaderboard • 🤗 Dataset • 🤗 Dataset Viewer

HumanEval-V is a novel and lightweight benchmark designed to evaluate the visual understanding and reasoning capabilities of Large Multimodal Models (LMMs) through coding tasks. The dataset comprises 108 entry-level Python programming challenges, adapted from platforms like CodeForces and Stack Overflow. Each task includes visual context that is indispensable to the problem, requiring models to perceive, reason, and generate Python code solutions accordingly.

Key features:

Visual coding tasks that require understanding images to solve.
Entry-level difficulty, making it ideal for assessing the baseline performance of foundational LMMs.
Handcrafted test cases for evaluating code correctness through an execution-based metric pass@k.

HumanEval-V

AI & ML interests

HumanEval-V: A Lightweight Visual Understanding and Reasoning Benchmark for Evaluating LMMs through Coding Tasks

spaces 1

HumanEval V Benchmark Viewer

models

datasets 1

HumanEval-V/HumanEval-V-Benchmark

AI & ML interests

Team members 1

HumanEval-V: A Lightweight Visual Understanding and Reasoning Benchmark for Evaluating LMMs through Coding Tasks

spaces 1

HumanEval V Benchmark Viewer

models

datasets 1