nvidia
/

NV-Llama2-13B-RLHF-RM

Text Generation

NeMo

English

nvidia

llama2

Model card Files Files and versions Community

zhilinw commited on Mar 9

Commit

8acb5d9

•

1 Parent(s): 131cdc4

Update README.md

Browse files

Files changed (1) hide show

README.md +4 -1

README.md CHANGED Viewed

@@ -11,6 +11,7 @@ tags:
 - llama2
 datasets:
 - Anthropic/hh-rlhf
 ---
 # Llama2-13B-RLHF-RM
@@ -20,10 +21,12 @@ datasets:
 ## Description:
 Llama2-13B-RLHF-RM is a 13 billion parameter language model (with context of up to 4,096 tokens) used as the Reward Model in training [NV-Llama2-70B-RLHF-Chat](https://huggingface.co/nvidia/NV-Llama2-70B-RLHF-Chat), which achieves 7.59 on MT-Bench and demonstrates strong performance on academic benchmarks.
-Starting from [Llama2-13B base model](https://huggingface.co/meta-llama/Llama-2-13b), it is first instruction-tuned with a combination of public and proprietary data and then trained on [HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) with reward modeling objective. Given a conversation with multiple turns between user and assistant, it assigns a preference score on the last assistant turn.
 Llama2-13B-RLHF-RM is trained with NVIDIA [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner), a scalable toolkit for performant and efficient model alignment. NeMo-Aligner is built using the [NeMo Framework](https://github.com/NVIDIA/NeMo) which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of our checkpoints are cross compatible with the NeMo ecosystem, allowing for inference deployment and further customization.
 ## Usage:
 Training a reward model is an essential component of Reinforcement Learning from Human Feedback (RLHF). By developing a strong reward model, we can mitigate the risks of reward hacking and ensure that the actor is incentivized to produce helpful responses. We are open-sourcing this reward model so that users can seamlessly integrate it with Proximal Policy Optimization (PPO) training using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner). For detailed instructions on how to conduct the training, please refer to our [RLHF training user guide](https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/RLHF.rst).

 - llama2
 datasets:
 - Anthropic/hh-rlhf
+- nvidia/sft_datablend_v1
 ---
 # Llama2-13B-RLHF-RM
 ## Description:
 Llama2-13B-RLHF-RM is a 13 billion parameter language model (with context of up to 4,096 tokens) used as the Reward Model in training [NV-Llama2-70B-RLHF-Chat](https://huggingface.co/nvidia/NV-Llama2-70B-RLHF-Chat), which achieves 7.59 on MT-Bench and demonstrates strong performance on academic benchmarks.
+Starting from [Llama2-13B base model](https://huggingface.co/meta-llama/Llama-2-13b), it is first instruction-tuned with [NVIDIA SFT Datablend v1](https://huggingface.co/datasets/nvidia/sft_datablend_v1) [^1] and then trained on [HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) with reward modeling objective. Given a conversation with multiple turns between user and assistant, it assigns a preference score on the last assistant turn.
 Llama2-13B-RLHF-RM is trained with NVIDIA [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner), a scalable toolkit for performant and efficient model alignment. NeMo-Aligner is built using the [NeMo Framework](https://github.com/NVIDIA/NeMo) which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of our checkpoints are cross compatible with the NeMo ecosystem, allowing for inference deployment and further customization.
+[^1]: as well as ~5k proprietary datapoints that we are unable to release due to data vendor restrictions
 ## Usage:
 Training a reward model is an essential component of Reinforcement Learning from Human Feedback (RLHF). By developing a strong reward model, we can mitigate the risks of reward hacking and ensure that the actor is incentivized to produce helpful responses. We are open-sourcing this reward model so that users can seamlessly integrate it with Proximal Policy Optimization (PPO) training using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner). For detailed instructions on how to conduct the training, please refer to our [RLHF training user guide](https://github.com/NVIDIA/NeMo-Aligner/blob/main/docs/user-guide/RLHF.rst).