Haoxiang-Wang
commited on
Commit
•
8c929ad
1
Parent(s):
40b73bd
Update README.md
Browse files
README.md
CHANGED
@@ -3,6 +3,10 @@ license: llama3
|
|
3 |
---
|
4 |
# LLaMA3-iterative-DPO-final
|
5 |
|
|
|
|
|
|
|
|
|
6 |
## Introduction
|
7 |
We release an unofficial checkpoint of a state-of-the-art instruct model of its class, **LLaMA3-iterative-DPO-final**.
|
8 |
On all three widely-used instruct model benchmarks: **Alpaca-Eval-V2**, **MT-Bench**, **Chat-Arena-Hard**, our model outperforms all models of similar size (e.g., LLaMA-3-8B-it), most large open-sourced models (e.g., Mixtral-8x7B-it),
|
|
|
3 |
---
|
4 |
# LLaMA3-iterative-DPO-final
|
5 |
|
6 |
+
* **Paper**: [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/pdf/2405.07863) (Published in TMLR, 2024)
|
7 |
+
* **Authors**: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
|
8 |
+
* **Code**: https://github.com/RLHFlow/Online-RLHF
|
9 |
+
|
10 |
## Introduction
|
11 |
We release an unofficial checkpoint of a state-of-the-art instruct model of its class, **LLaMA3-iterative-DPO-final**.
|
12 |
On all three widely-used instruct model benchmarks: **Alpaca-Eval-V2**, **MT-Bench**, **Chat-Arena-Hard**, our model outperforms all models of similar size (e.g., LLaMA-3-8B-it), most large open-sourced models (e.g., Mixtral-8x7B-it),
|