Commit
•
883c72b
1
Parent(s):
d3a830f
Update README.md (#6)
Browse files- Update README.md (10474ee8795b9a5e365c890d84d4ec978d49c9cd)
Co-authored-by: Haoxiang Wang <[email protected]>
README.md
CHANGED
@@ -2,6 +2,10 @@
|
|
2 |
license: cc-by-nc-4.0
|
3 |
---
|
4 |
|
|
|
|
|
|
|
|
|
5 |
This reward function can be used for RLHF, including PPO, iterative SFT, iterative DPO.
|
6 |
|
7 |
The license is derived from `PKU-Alignment/PKU-SafeRLHF-30K`.
|
|
|
2 |
license: cc-by-nc-4.0
|
3 |
---
|
4 |
|
5 |
+
* **Paper**: [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/pdf/2405.07863) (Published in TMLR, 2024)
|
6 |
+
* **Authors**: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
|
7 |
+
* **Code**: https://github.com/RLHFlow/RLHF-Reward-Modeling/
|
8 |
+
|
9 |
This reward function can be used for RLHF, including PPO, iterative SFT, iterative DPO.
|
10 |
|
11 |
The license is derived from `PKU-Alignment/PKU-SafeRLHF-30K`.
|