nvidia
/

NV-Llama2-13B-RLHF-RM

Text Generation

Model card Files Files and versions Community

jiaqiz commited on Feb 20

Commit

6b4575e

•

1 Parent(s): cc62a92

Update README.md

Files changed (1) hide show

README.md +3 -5

README.md CHANGED Viewed

@@ -9,6 +9,8 @@ fine-tuning: true
 tags:
 - nvidia
 - llama2
 ---
 # Llama2-13B-RLHF-RM
@@ -24,8 +26,4 @@ Llama2-13B-RLHF-RM is trained with NVIDIA [NeMo Aligner](https://github.com/NVID
 ## Usage:
-Training a reward model is an essential component of Reinforcement Learning from Human Feedback (RLHF). By developing a strong reward model, we can mitigate the risks of reward hacking and ensure that the actor is incentivized to produce helpful responses. We are open-sourcing this reward model so that users can seamlessly integrate it with Proximal Policy Optimization (PPO) training using [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner). For detailed instructions on how to conduct the training, please refer to our [RLHF training user guide](https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/RLHF.rst).

 tags:
 - nvidia
 - llama2
+datasets:
+- Anthropic/hh-rlhf
 ---
 # Llama2-13B-RLHF-RM
 ## Usage:
+Training a reward model is an essential component of Reinforcement Learning from Human Feedback (RLHF). By developing a strong reward model, we can mitigate the risks of reward hacking and ensure that the actor is incentivized to produce helpful responses. We are open-sourcing this reward model so that users can seamlessly integrate it with Proximal Policy Optimization (PPO) training using [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner). For detailed instructions on how to conduct the training, please refer to our [RLHF training user guide](https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/RLHF.rst).