sfairXC
/

FsfairX-LLaMA3-RM-v0.1

Text Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Update README.md

#6

by Haoxiang-Wang - opened Oct 14

base: refs/heads/main

←

from: refs/pr/6

Discussion Files changed

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -2,6 +2,10 @@
 license: cc-by-nc-4.0
 ---
 This reward function can be used for RLHF, including PPO, iterative SFT, iterative DPO.
 The license is derived from `PKU-Alignment/PKU-SafeRLHF-30K`.

 license: cc-by-nc-4.0
 ---
+* **Paper**: [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/pdf/2405.07863) (Published in TMLR, 2024)
+* **Authors**: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
+* **Code**: https://github.com/RLHFlow/RLHF-Reward-Modeling/
 This reward function can be used for RLHF, including PPO, iterative SFT, iterative DPO.
 The license is derived from `PKU-Alignment/PKU-SafeRLHF-30K`.