arxiv:2406.11817

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Published on Jun 17

· Submitted by

zhangysk on Jun 21

Upvote

Authors:

Jie Liu ,

Zhanhui Zhou ,

Jiaheng Liu ,

Xingyuan Bu ,

Han-Sen Zhong ,

Abstract

Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a 50.5% length-controlled win rate against GPT-4 Preview on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.

View arXiv page View PDF Add to collection

Community

zhangysk

Paper submitter Jun 21

The paper presents a case study demonstrating that iterative length-regularized DPO can enhance a 7B model to achieve a 50.5% length-controlled win rate on AlpacaEval 2.0 without substantially increasing response length.