arxiv:2412.06559

ProcessBench: Identifying Process Errors in Mathematical Reasoning

Published on Dec 9

· Submitted by

chujiezheng on Dec 10

#2 Paper of the day

Upvote

Authors:

Chujie Zheng ,

Zhenru Zhang ,

Runji Lin ,

Bowen Yu ,

Dayiheng Liu ,

Jingren Zhou ,

Junyang Lin

Abstract

As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.

View arXiv page View PDF Add to collection

Community

chujiezheng

Paper author Paper submitter 2 days ago

We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.

chujiezheng

Paper author 1 day ago

Data: https://huggingface.co/datasets/Qwen/ProcessBench
Evaluation code: https://github.com/QwenLM/ProcessBench

Tigerph

2 days ago

Here are some intriguing conclusions from a few experiments:

Presently, various PRMs that are based on MCTS for training data construction may not perform as effectively as directly training with the PRM800K dataset.
The more challenging the dataset, the higher the proportion of cases where the answer is correct but the process leading to it is flawed. In datasets of Omini-MATH level difficulty, this phenomenon occurs in over 50% of instances. Therefore, relying solely on answer matching as the reward rule might lead to scaling issues in the future.
Surprisingly, the reasoning model QwQ-32B-preview, which was not designed for the critic role and has not been trained on related data, performs exceptionally well in the critic function, surpassing all known PRM models to date.