Online RLHF
Collection
Datasets, code, and models for online RLHF (i.e., iterative DPO)
β’
19 items
β’
Updated
β’
4
This is the SFT checkpoint used for the project RLHFlow/Online-RLHF
The model is trained from meta-llama/Meta-Llama-3-8B on a mixture of diverse open-source high-quality data for 1 epoch with detailed parameters in the report. It has not been trained by RLHF and can serve as a good starting point for the RLHF research.
We use ToRA script to evaluate GSM8K and MATH, Evalplut for HumanEval, and lm-evaluation-harness for other benchmarks. The model is evaluated in zero-shot setting so the results here may be slightly different from that reported in the technical report.
Model | Size | Method | LC AlpacaEval | MT-Bench | GSM-8K | MMLU | HumanEval | TruthfulQA | ARC | MBPP |
---|---|---|---|---|---|---|---|---|---|---|
LLaMA-3-8B-it | 8B | RS+DPO+PPO | 22.9 | 8.16 | 79.6 | 66.0 | 61.6 | 43.9 | 59.5 | 61.1 |
Ours (SFT baseline) | 8B | SFT | 10.2 | 7.69 | 74.2 | 30.0 | 64.6 | 63.4 | 53.5 | 58.6 |
Ours (Iterative RLHF) | 8B | Iterative DPO | 37.2 | 8.46 | 80.7 | 65.3 | 64.6 | 60.4 | 64.3 | 60.8 |
Please cite our techical report if you find our model is useful for your research or product.
@misc{dong2024rlhf,
title={RLHF Workflow: From Reward Modeling to Online RLHF},
author={Hanze Dong and Wei Xiong and Bo Pang and Haoxiang Wang and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang},
year={2024},
eprint={2405.07863},
archivePrefix={arXiv},
primaryClass={cs.LG}
}