---
license: apache-2.0
datasets:
- Anthropic/hh-rlhf
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- rlhf
- alignment
- simulation
- computational social science
---


# Model Card for HH-RLHF-SFT

![model image](https://agwarbliu.s3.amazonaws.com/logo.png)

![model image](https://agwarbliu.s3.amazonaws.com/model_select_sft.png)


**Efficient, Effective, and Stable alternative of RLHF!**

**Instead of training an additional reward model that is likely to be gamed, we directly train the model on the social games!** 🕹️ 🎲 🎮

Full details on simulation and training can be found [here](https://github.com/agi-templar/Stable-Alignment).

# Training Procedure

This is the second step of Stable Alignment project, which is a supervised fine-tuned model on [Anthropic HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) (only on the 'accepted' options).

We use the [Alpaca fine-tuning script](https://github.com/tatsu-lab/stanford_alpaca) to train this model.


# Bias, Risks, and Limitations

Although this project aims to better align current LMs with social norms, inappropriate content and inherent biases in the training data will still impair the alignment of the model.

The model should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.

# Citation

Please cite our paper if you use the data or code in this repo:

```bibtex
@misc{liu2023sociallyaligned,
      title={Training Socially Aligned Language Models in Simulated Human Society},
      author={Ruibo Liu and Ruixin Yang and Chenyan Jia and Ge Zhang and Denny Zhou and Andrew M. Dai and Diyi Yang and Soroush Vosoughi},
      year={2023},
      eprint={2305.16960},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```