Abstract
This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct. Code will be available at https://github.com/zhentingqi/rStar.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks (2024)
- Thought-Like-Pro: Enhancing Reasoning of Large Language Models through Self-Driven Prolog-based Chain-of-Thought (2024)
- MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models (2024)
- Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning (2024)
- Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
In simple language, the process is a very sophisticated prompting. It involves generating multiple candidate solutions from a single LLM, which are then refined through feedback.
The candidate solutions are input back into the LLM, which uses a reward function to evaluate and refine its output. The reward function is designed to guide the generation of candidate solutions, providing a score or feedback that helps the LLM refine its output.
Once the candidate solutions are refined, they are fed back into the LLM, which uses the feedback to generate a final solution. This process allows the LLM to iteratively refine its output, producing a more accurate and effective solution.
Can you resend the link to your GitHub code. The above link dosent work
I have done an open source implementation of the technique in the optillm repo you can see here - https://github.com/codelion/optillm/blob/main/rstar.py
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper