Papers
arxiv:2305.14718

Improving Language Models with Advantage-based Offline Policy Gradients

Published on May 24, 2023
Authors:
,
,
,
,

Abstract

Abstract Language Models (LMs) achieve substantial language capabilities when finetuned using Reinforcement Learning with Human Feedback (RLHF). However, RLHF is an unstable and data-hungry process that continually requires new high-quality LM-generated data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new class of offline policy gradient algorithms that enable RL training on any pre-existing data. By assuming the entire LM output sequence as a single action, A-LoL allows incorporating sequence-level classifiers or human-designed scoring functions as rewards. Subsequently, by using LM's internal sequence-level value estimate, A-LoL filters negative advantage (low-quality) data points during training, making it resilient to noise. Overall, A-LoL is an easy-to-implement LM training recipe that is sample-efficient and stable. We demonstrate the effectiveness of A-LoL and its variants with a set of four different language generation tasks. We compare against both online RL (PPO) and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RL baselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant (HHA), LMs trained with A-LoL methods achieve the highest diversity while also being rated more safe and helpful than baselines according to humans. Additionally, in the remaining three tasks, A-LoL could optimize multiple distinct reward functions even when using noisy or suboptimal training data. We also release our experimental code. https://github.com/abaheti95/LoL-RL

Community

Hi authors,
Interesting paper
do we have any models on LMSYS chatbot arena trained on this technique ?

also are there any recent studied that compare this work further with PPO with larger models ?

·
Paper author

Hi @karthik-ganesan-nexusflow ,
Thank you for taking an interest in our work. We haven't had an opportunity to test our method with large-scale datasets and models yet. I would like to extend this work into an offline + online method in my subsequent project and then systematically compare it with an online PPO baseline.
Would love to know if you tried experimenting with A-LoL. We have the code available on Git Hub.

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2305.14718 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2305.14718 in a Space README.md to link it from this page.

Collections including this paper 2