RLHFlow
/

LLaMA3-iterative-DPO-final

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Haoxiang-Wang commited on 23 days ago

Commit

8c929ad

•

1 Parent(s): 40b73bd

Update README.md

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -3,6 +3,10 @@ license: llama3
 ---
 # LLaMA3-iterative-DPO-final
 ## Introduction
 We release an unofficial checkpoint of a state-of-the-art instruct model of its class, **LLaMA3-iterative-DPO-final**.
 On all three widely-used instruct model benchmarks: **Alpaca-Eval-V2**, **MT-Bench**, **Chat-Arena-Hard**, our model outperforms all models of similar size (e.g., LLaMA-3-8B-it), most large open-sourced models (e.g., Mixtral-8x7B-it),

 ---
 # LLaMA3-iterative-DPO-final
+* **Paper**: [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/pdf/2405.07863) (Published in TMLR, 2024)
+* **Authors**: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
+* **Code**: https://github.com/RLHFlow/Online-RLHF
 ## Introduction
 We release an unofficial checkpoint of a state-of-the-art instruct model of its class, **LLaMA3-iterative-DPO-final**.
 On all three widely-used instruct model benchmarks: **Alpaca-Eval-V2**, **MT-Bench**, **Chat-Arena-Hard**, our model outperforms all models of similar size (e.g., LLaMA-3-8B-it), most large open-sourced models (e.g., Mixtral-8x7B-it),