nvidia
/

Llama-3.1-Nemotron-70B-Reward-HF

@@ -24,7 +24,29 @@ For the same prompt, a response with higher reward score has higher quality than
 Llama-3.1-Nemotron-70B-Reward-HF has been converted from [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) to support it in the HuggingFace Transformers codebase. Please note that evaluation results might be slightly different from the [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) as evaluated in NeMo-Aligner, which the evaluation results below are based on.
-Try it for free at [build.nvidia.com](https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-reward) - it comes with an OpenAI-compatible API interface!
 ## Terms of use
@@ -34,7 +56,7 @@ By accessing this model, you are agreeing to the LLama 3.1 terms and conditions
 ## RewardBench Primary Dataset LeaderBoard
-As of 30 Sept 2024, Llama-3.1-Nemotron-70B-Reward performs best Overall on RewardBench as well as with strong performance in Chat, Safety and Reasoning categories among the models below.
  | Model  | Type of Data Used For Training |  Overall | Chat | Chat-Hard | Safety | Reasoning |
 |:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
@@ -107,6 +129,16 @@ E-Mail: [Zhilin Wang](mailto:[email protected])
 If you find this model useful, please cite the following works
 ```bibtex
 @misc{wang2024helpsteer2,
       title={HelpSteer2: Open-source dataset for training top-performing reward models},
       author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev},
@@ -119,6 +151,7 @@ If you find this model useful, please cite the following works
 ## References(s):
 * [HelpSteer2](https://arxiv.org/abs/2406.08673)
 * [HelpSteer](https://arxiv.org/abs/2311.09528)
 * [SteerLM method](https://arxiv.org/abs/2310.05344)

 Llama-3.1-Nemotron-70B-Reward-HF has been converted from [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) to support it in the HuggingFace Transformers codebase. Please note that evaluation results might be slightly different from the [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) as evaluated in NeMo-Aligner, which the evaluation results below are based on.
+Try hosted inference for free at [build.nvidia.com](https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-reward) - it comes with an OpenAI-compatible API interface and simply signing up gets you 100k free API calls to this model.
+Using this reward model for RLHF (specifically, REINFORCE), we were able to tune a Llama-3.1-70B-Instruct model to reach [AlpacaEval 2 LC](https://tatsu-lab.github.io/alpaca_eval/) of 57.6, [Arena Hard](https://github.com/lmarena/arena-hard-auto) of 85.0 and [GPT-4-Turbo MT-Bench](https://github.com/lm-sys/FastChat/pull/3158) of 8.98, which are known to be predictive of [LMSys Chatbot Arena Elo](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)
+As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks, edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.
+See details on our paper at [https://arxiv.org/abs/2410.01257](https://arxiv.org/abs/2410.01257) - as a preview, this model can correctly the question ```How many r in strawberry?``` without specialized prompting or additional reasoning tokens:
+```
+A sweet question!
+Let’s count the “R”s in “strawberry”:
+1. S
+2. T
+3. R
+4. A
+5. W
+6. B
+7. E
+8. R
+9. R
+10. Y
+There are **3 “R”s** in the word “strawberry”.
+```
 ## Terms of use
 ## RewardBench Primary Dataset LeaderBoard
+As of 1 Oct 2024, Llama-3.1-Nemotron-70B-Reward performs best Overall on RewardBench as well as with strong performance in Chat, Safety and Reasoning categories among the models below.
  | Model  | Type of Data Used For Training |  Overall | Chat | Chat-Hard | Safety | Reasoning |
 |:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
 If you find this model useful, please cite the following works
 ```bibtex
+@misc{wang2024helpsteer2preferencecomplementingratingspreferences,
+      title={HelpSteer2-Preference: Complementing Ratings with Preferences},
+      author={Zhilin Wang and Alexander Bukharin and Olivier Delalleau and Daniel Egert and Gerald Shen and Jiaqi Zeng and Oleksii Kuchaiev and Yi Dong},
+      year={2024},
+      eprint={2410.01257},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2410.01257},
+}
 @misc{wang2024helpsteer2,
       title={HelpSteer2: Open-source dataset for training top-performing reward models},
       author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev},
 ## References(s):
+* [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
 * [HelpSteer2](https://arxiv.org/abs/2406.08673)
 * [HelpSteer](https://arxiv.org/abs/2311.09528)
 * [SteerLM method](https://arxiv.org/abs/2310.05344)