Update README.md
Browse files
README.md
CHANGED
@@ -24,11 +24,14 @@ The Unified-Feedback dataset contains diverse preference data from prior open-so
|
|
24 |
* openbmb/UltraFeedback
|
25 |
* argilla/ultrafeedback-binarized-preferences-cleaned
|
26 |
* berkeley-nest/Nectar.
|
27 |
-
|
28 |
|
|
|
|
|
|
|
|
|
29 |
|
30 |
## Evaluation
|
31 |
-
We evaluate this reward model on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which demonstrates that this model is
|
32 |
|
33 |
| Model | Average | Chat | Chat Hard | Safety | Reasoning | Prior Sets |
|
34 |
|:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
|
|
|
24 |
* openbmb/UltraFeedback
|
25 |
* argilla/ultrafeedback-binarized-preferences-cleaned
|
26 |
* berkeley-nest/Nectar.
|
|
|
27 |
|
28 |
+
## Training Code and Blog
|
29 |
+
|
30 |
+
We merge the training script at https://github.com/WeiXiongUST/RLHF-Reward-Modeling, which is based on the [trl](https://github.com/huggingface/trl) package. In addition, this [blog](https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0?pvs=4) introduces some basic knowledge and shares experimental experience.
|
31 |
+
|
32 |
|
33 |
## Evaluation
|
34 |
+
We evaluate this reward model on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which demonstrates that this model is close to **current best 7B reward model** and outperforms prior SOTA reward models such as openbmb/UltraRM-13b and berkeley-nest/Starling-RM-7B-alpha.
|
35 |
|
36 |
| Model | Average | Chat | Chat Hard | Safety | Reasoning | Prior Sets |
|
37 |
|:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
|