Haoxiang-Wang commited on
Commit
8c929ad
1 Parent(s): 40b73bd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -0
README.md CHANGED
@@ -3,6 +3,10 @@ license: llama3
3
  ---
4
  # LLaMA3-iterative-DPO-final
5
 
 
 
 
 
6
  ## Introduction
7
  We release an unofficial checkpoint of a state-of-the-art instruct model of its class, **LLaMA3-iterative-DPO-final**.
8
  On all three widely-used instruct model benchmarks: **Alpaca-Eval-V2**, **MT-Bench**, **Chat-Arena-Hard**, our model outperforms all models of similar size (e.g., LLaMA-3-8B-it), most large open-sourced models (e.g., Mixtral-8x7B-it),
 
3
  ---
4
  # LLaMA3-iterative-DPO-final
5
 
6
+ * **Paper**: [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/pdf/2405.07863) (Published in TMLR, 2024)
7
+ * **Authors**: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
8
+ * **Code**: https://github.com/RLHFlow/Online-RLHF
9
+
10
  ## Introduction
11
  We release an unofficial checkpoint of a state-of-the-art instruct model of its class, **LLaMA3-iterative-DPO-final**.
12
  On all three widely-used instruct model benchmarks: **Alpaca-Eval-V2**, **MT-Bench**, **Chat-Arena-Hard**, our model outperforms all models of similar size (e.g., LLaMA-3-8B-it), most large open-sourced models (e.g., Mixtral-8x7B-it),