Text Classification
Transformers
Safetensors
llama
text-generation-inference
Inference Endpoints
hendrydong Haoxiang-Wang commited on
Commit
883c72b
1 Parent(s): d3a830f

Update README.md (#6)

Browse files

- Update README.md (10474ee8795b9a5e365c890d84d4ec978d49c9cd)


Co-authored-by: Haoxiang Wang <[email protected]>

Files changed (1) hide show
  1. README.md +4 -0
README.md CHANGED
@@ -2,6 +2,10 @@
2
  license: cc-by-nc-4.0
3
  ---
4
 
 
 
 
 
5
  This reward function can be used for RLHF, including PPO, iterative SFT, iterative DPO.
6
 
7
  The license is derived from `PKU-Alignment/PKU-SafeRLHF-30K`.
 
2
  license: cc-by-nc-4.0
3
  ---
4
 
5
+ * **Paper**: [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/pdf/2405.07863) (Published in TMLR, 2024)
6
+ * **Authors**: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
7
+ * **Code**: https://github.com/RLHFlow/RLHF-Reward-Modeling/
8
+
9
  This reward function can be used for RLHF, including PPO, iterative SFT, iterative DPO.
10
 
11
  The license is derived from `PKU-Alignment/PKU-SafeRLHF-30K`.