Ray2333 commited on
Commit
3651a22
1 Parent(s): fd65808

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -2
README.md CHANGED
@@ -24,11 +24,14 @@ The Unified-Feedback dataset contains diverse preference data from prior open-so
24
  * openbmb/UltraFeedback
25
  * argilla/ultrafeedback-binarized-preferences-cleaned
26
  * berkeley-nest/Nectar.
27
-
28
 
 
 
 
 
29
 
30
  ## Evaluation
31
- We evaluate this reward model on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which demonstrates that this model is the **current best 7B reward model** and outperforms prior SOTA reward models such as openbmb/UltraRM-13b and berkeley-nest/Starling-RM-7B-alpha.
32
 
33
  | Model | Average | Chat | Chat Hard | Safety | Reasoning | Prior Sets |
34
  |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
 
24
  * openbmb/UltraFeedback
25
  * argilla/ultrafeedback-binarized-preferences-cleaned
26
  * berkeley-nest/Nectar.
 
27
 
28
+ ## Training Code and Blog
29
+
30
+ We merge the training script at https://github.com/WeiXiongUST/RLHF-Reward-Modeling, which is based on the [trl](https://github.com/huggingface/trl) package. In addition, this [blog](https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0?pvs=4) introduces some basic knowledge and shares experimental experience.
31
+
32
 
33
  ## Evaluation
34
+ We evaluate this reward model on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which demonstrates that this model is close to **current best 7B reward model** and outperforms prior SOTA reward models such as openbmb/UltraRM-13b and berkeley-nest/Starling-RM-7B-alpha.
35
 
36
  | Model | Average | Chat | Chat Hard | Safety | Reasoning | Prior Sets |
37
  |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|