MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Abstract
The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning (2024)
- SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark (2024)
- BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models (2024)
- GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving (2024)
- MULTI: Multimodal Understanding Leaderboard with Text and Images (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Thanks for sharing dataset! https://huggingface.co/datasets/AI4Math/MathVerse
Can't wait to see the leaderboard hosts on the Hub๐ฅ
Can AI Really See Math? Exploring MathVerse and Multi-Modal Models
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper