agi-css commited on
Commit
6f56fc0
1 Parent(s): f4ad8e7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -0
README.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Anthropic/hh-rlhf
5
+ language:
6
+ - en
7
+ library_name: transformers
8
+ pipeline_tag: text-generation
9
+ tags:
10
+ - rlhf
11
+ - alignment
12
+ - simulation
13
+ - computational social science
14
+ ---
15
+
16
+
17
+ # Model Card for So(cially)-Good LM
18
+
19
+ ![model image](https://agwarbliu.s3.amazonaws.com/logo.png)
20
+
21
+ ![model image](https://agwarbliu.s3.amazonaws.com/model_select_sft.png)
22
+
23
+
24
+ **Efficient, Effective, and Stable alternative of RLHF!**
25
+
26
+ **Instead of training an additional reward model that is likely to be gamed, we directly train the model on the social games!** 🕹️ 🎲 🎮
27
+
28
+ Full details on simulation and training can be found [here](https://github.com/agi-templar/Stable-Alignment).
29
+
30
+ # Training Procedure
31
+
32
+ This is the second step of Stable Alignment project, which is a supervised fine-tuned model on [Anthropic HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) (only on the 'accepted' options).
33
+
34
+ We use the [Alpaca fine-tuning script](https://github.com/tatsu-lab/stanford_alpaca) to train this model.
35
+
36
+
37
+ # Bias, Risks, and Limitations
38
+
39
+ Although this project aims to better align current LMs with social norms, inappropriate content and inherent biases in the training data will still impair the alignment of the model.
40
+
41
+ The model should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.