arxiv:2408.06195

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Published on Aug 12

· Submitted by

akhaliq on Aug 13

#2 Paper of the day

Upvote

Authors:

Zhenting Qi ,

Mingyuan Ma ,

Jiahang Xu ,

Li Lyna Zhang ,

Abstract

This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct. Code will be available at https://github.com/zhentingqi/rStar.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter Aug 13

https://github.com/zhentingqi/rStar

tdingman

Aug 13

404

librarian-bot

Aug 14

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

cedric

Aug 14

•

edited Aug 15

In simple language, the process is a very sophisticated prompting. It involves generating multiple candidate solutions from a single LLM, which are then refined through feedback.

The candidate solutions are input back into the LLM, which uses a reward function to evaluate and refine its output. The reward function is designed to guide the generation of candidate solutions, providing a score or feedback that helps the LLM refine its output.

Once the candidate solutions are refined, they are fed back into the LLM, which uses the feedback to generate a final solution. This process allows the LLM to iteratively refine its output, producing a more accurate and effective solution.

fusedion

Aug 25

Can you resend the link to your GitHub code. The above link dosent work

codelion

Sep 5

I have done an open source implementation of the technique in the optillm repo you can see here - https://github.com/codelion/optillm/blob/main/rstar.py