Update README.md
Browse files
README.md
CHANGED
@@ -9,6 +9,8 @@ fine-tuning: true
|
|
9 |
tags:
|
10 |
- nvidia
|
11 |
- llama2
|
|
|
|
|
12 |
---
|
13 |
|
14 |
# Llama2-13B-RLHF-RM
|
@@ -24,8 +26,4 @@ Llama2-13B-RLHF-RM is trained with NVIDIA [NeMo Aligner](https://github.com/NVID
|
|
24 |
|
25 |
## Usage:
|
26 |
|
27 |
-
Training a reward model is an essential component of Reinforcement Learning from Human Feedback (RLHF). By developing a strong reward model, we can mitigate the risks of reward hacking and ensure that the actor is incentivized to produce helpful responses. We are open-sourcing this reward model so that users can seamlessly integrate it with Proximal Policy Optimization (PPO) training using [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner). For detailed instructions on how to conduct the training, please refer to our [RLHF training user guide](https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/RLHF.rst).
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
|
|
9 |
tags:
|
10 |
- nvidia
|
11 |
- llama2
|
12 |
+
datasets:
|
13 |
+
- Anthropic/hh-rlhf
|
14 |
---
|
15 |
|
16 |
# Llama2-13B-RLHF-RM
|
|
|
26 |
|
27 |
## Usage:
|
28 |
|
29 |
+
Training a reward model is an essential component of Reinforcement Learning from Human Feedback (RLHF). By developing a strong reward model, we can mitigate the risks of reward hacking and ensure that the actor is incentivized to produce helpful responses. We are open-sourcing this reward model so that users can seamlessly integrate it with Proximal Policy Optimization (PPO) training using [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner). For detailed instructions on how to conduct the training, please refer to our [RLHF training user guide](https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/RLHF.rst).
|
|
|
|
|
|
|
|