File size: 4,029 Bytes
e585ddb
 
 
9742e83
612dc8f
 
340e8d2
9742e83
 
f5e02b8
7dbbb52
aa409da
 
eb0adc3
 
 
9742e83
3053553
9742e83
612dc8f
9742e83
3053553
9742e83
04beed3
3053553
 
 
 
 
 
 
9742e83
 
 
2e78e24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9742e83
 
 
 
 
 
3053553
9742e83
3053553
 
 
 
 
 
9742e83
 
3053553
9742e83
 
87cf99e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e585ddb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
license: cc-by-sa-4.0
---

This model is the RLHF version of `HuggingFaceH4/mistral-7b-sft-beta` without any external responses. 

We perform GSHF algorithm on SFT baseline. The external signals include (1) Reward model; (2) AI-generated Prompts.


**We obtain 35.95% win-rate (34.79% LC win-rate) on Alpaca Eval v2.** The win-rate of the base model is only 4.63%. 

For MT-bench, it obtained about 7.5, where the base model is only 5.3.

We have demonstrated the significant potential of the iterative RLHF algorithm for LLMs to deliver appropriate and well-structured responses, 
even without any external responses. 


## Model Details

We perform 3 iterations of GSHF algorithm on `HuggingFaceH4/mistral-7b-sft-beta` labeled by reward model, where prompts are generated by ChatGPT with self-instruct type prompt augmentation.

We use AI-generated 60K prompts in the training process.

Examples are as below,
```json
{"prompt": "Why is gold considered a good reserve asset for central banks?"}
{"prompt": "What are the top 5 yoga poses for stress relief?"}
{"prompt": "Craft a blog title about the health implications of eating avocados daily based on their caloric value."}
{"prompt": "Design a simple HTML chat interface that simulates a conversation between a user and a bot, displaying two messages from each."}
{"prompt": "List 10 names from different cultures that embody the meanings of peace, harmony, or compassion."}
```

## Uses

The usage and chat template format follow the SFT model `HuggingFaceH4/mistral-7b-sft-beta`.

```python
# Install transformers from source - only needed for versions <= v4.34
# pip install git+https://github.com/huggingface/transformers.git
# pip install accelerate

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="sfairXC/FsfairX-Zephyr-Chat-v0.1", torch_dtype=torch.bfloat16, device_map="auto")

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate"},
    {"role": "user",   "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
# <|system|>
# You are a friendly chatbot who always responds in the style of a pirate.</s>
# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# Ah, me hearty matey! But yer question be a puzzler! A human cannot eat a helicopter in one sitting, as helicopters are not edible. They be made of metal, plastic, and other materials, not food!
```




## Evaluation

The evaluation on Alpaca Eval v2 are provided as below,

| Model       | Win Rate | LC Win Rate | Avg Length |
|-------------|----------|-------------|------------|
| Base        | 4.63     | 8.01        | 916        |
| Iteration 1 | 13.26    | 20.81       | 1205       |
| Iteration 2 | 23.57    | 27.63       | 1623       |
| Iteration 3 | 35.95    | 34.79       | 2275       |


## Citation 



If you found this helpful, please cite the following papers.

```bibtex
@article{dong2023raft,
  title={Raft: Reward ranked finetuning for generative foundation model alignment},
  author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
  journal={arXiv preprint arXiv:2304.06767},
  year={2023}
}

@misc{xiong2024iterative,
      title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint}, 
      author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},
      year={2024},
      eprint={2312.11456},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```