--- license: mit datasets: - llm-blender/Unified-Feedback language: - en metrics: - accuracy library_name: transformers pipeline_tag: text-classification --- ## Introduction The reward model finetunes [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on the '[llm-blender/Unified-Feedback](https://huggingface.co/datasets/llm-blender/Unified-Feedback)' dataset. This model achieves an accuracy of **0.7740** on the test sets, making it a good proxy reward model for modeling human preferences and can be used for aligning LLMs. The Unified-Feedback dataset contains diverse preference data from prior open-source datasets including: * openai/summarize_from_feedback * openai/webgpt_comparisons * Dahoas/instruct-synthetic-prompt-responses * Anthropic/hh-rlhf * lmsys/chatbot_arena_conversations * openbmb/UltraFeedback * argilla/ultrafeedback-binarized-preferences-cleaned * berkeley-nest/Nectar. ## Training Code and Blog We merge the training script at https://github.com/WeiXiongUST/RLHF-Reward-Modeling, which is based on the [trl](https://github.com/huggingface/trl) package. In addition, this [blog](https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0?pvs=4) introduces some basic knowledge and shares experimental experience. ## Evaluation We evaluate this reward model on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which demonstrates that this model is close to **current best 7B reward model** and outperforms prior SOTA reward models such as openbmb/UltraRM-13b and berkeley-nest/Starling-RM-7B-alpha. | Model | Average | Chat | Chat Hard | Safety | Reasoning | Prior Sets | |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:| | berkeley-nest/Starling-RM-34B (34B) | 81.5 | 96.9 | 59 | 89.9 | 90.3 | 71.4 | | **Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback**(Ours, 7B) | 78.75 | 97.84 | 52.85 | 85.94 | 87.02 | 73.92 | | berkeley-nest/Starling-RM-7B-alpha (7B) | 74.6 | 98 | 43.4 | 88.6 | 74.6 | 68.6 | | openbmb/UltraRM-13b (13B) | 71.3 | 96.1 | 55.3 | 45.8 | 82 | 77.2 | | IDEA-CCNL/Ziya-LLaMA-7B-Reward (7B) | 66 | 88 | 41.3 | 62.5 | 73.7 | 64.6 | | OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 (1.4B) | 65.1 | 88.5 | 47.9 | 62.1 | 61.4 | 65.8 | | OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1 (7B) | 64 | 94.4 | 36.6 | 59.4 | 70 | 59.4 | | llm-blender/PairRM-hf (0.4B) | 60.9 | 90.2 | 53 | 31.5 | 60 | 69.6 | ## Usage ``` import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification # load model and tokenizer tokenizer = AutoTokenizer.from_pretrained('Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback') reward_model = AutoModelForSequenceClassification.from_pretrained( 'Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback', num_labels=1, torch_dtype=torch.float16, device_map=0, ) message = [ {'role': 'user', 'content': "I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her?"}, {'role': 'assistant', 'content': "Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?"} ] message_template = tokenizer.apply_chat_template(message, tokenize=False) # it will look like this: " [INST] I'm going to go out to a movie, but I need someone to chat with my daughter and pretend to be me while she's home alone. But I can't do that while I'm at the movie. Can you help by impersonating me by chat with her? [/INST]Sorry, I'm not comfortable impersonating you in that way. I'm not willing to behave so dishonestly. Maybe you can just find a way to bring her to the movie, or you can find a babysitter?" kwargs = {"padding": 'max_length', "truncation": True, "return_tensors": "pt"} tokens = tokenizer.encode_plus(message_template, **kwargs) with torch.no_grad(): reward_tensor = model(tokens["input_ids"][0].to(model.device), attention_mask=tokens["attention_mask"][0].to(model.device)).logits.reshape(-1) reward = reward_tensor.cpu().detach().item() ``` ## Citation This reward model is used as a gold reward model for the following research https://arxiv.org/abs/2406.10216. If you find this model helpful for your research, please cite ``` @article{yang2024regularizing, title={Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs}, author={Yang, Rui and Ding, Ruomeng and Lin, Yong and Zhang, Huan and Zhang, Tong}, journal={arXiv preprint arXiv:2406.10216}, year={2024} } ```