Text Generation
NeMo
English
nvidia
llama2
jiaqiz commited on
Commit
6b4575e
1 Parent(s): cc62a92

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -5
README.md CHANGED
@@ -9,6 +9,8 @@ fine-tuning: true
9
  tags:
10
  - nvidia
11
  - llama2
 
 
12
  ---
13
 
14
  # Llama2-13B-RLHF-RM
@@ -24,8 +26,4 @@ Llama2-13B-RLHF-RM is trained with NVIDIA [NeMo Aligner](https://github.com/NVID
24
 
25
  ## Usage:
26
 
27
- Training a reward model is an essential component of Reinforcement Learning from Human Feedback (RLHF). By developing a strong reward model, we can mitigate the risks of reward hacking and ensure that the actor is incentivized to produce helpful responses. We are open-sourcing this reward model so that users can seamlessly integrate it with Proximal Policy Optimization (PPO) training using [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner). For detailed instructions on how to conduct the training, please refer to our [RLHF training user guide](https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/RLHF.rst).
28
-
29
-
30
-
31
-
 
9
  tags:
10
  - nvidia
11
  - llama2
12
+ datasets:
13
+ - Anthropic/hh-rlhf
14
  ---
15
 
16
  # Llama2-13B-RLHF-RM
 
26
 
27
  ## Usage:
28
 
29
+ Training a reward model is an essential component of Reinforcement Learning from Human Feedback (RLHF). By developing a strong reward model, we can mitigate the risks of reward hacking and ensure that the actor is incentivized to produce helpful responses. We are open-sourcing this reward model so that users can seamlessly integrate it with Proximal Policy Optimization (PPO) training using [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner). For detailed instructions on how to conduct the training, please refer to our [RLHF training user guide](https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/RLHF.rst).