Update README.md

8c18612 verified 3 months ago

4.56 kB

	---
	base_model: UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3
	datasets:
	- openbmb/UltraFeedback
	language:
	- en
	license: apache-2.0
	pipeline_tag: text-generation
	tags:
	- autoquant
	- UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3
	- gptq
	---
	Self-Play Preference Optimization for Language Model Alignment (https://arxiv.org/abs/2405.00675)

	# Mistral7B-PairRM-SPPO-Iter3

	This model was developed using [Self-Play Preference Optimization](https://arxiv.org/abs/2405.00675) at iteration 3, based on the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) architecture as starting point. We utilized the prompt sets from the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset, splited to 3 parts for 3 iterations by [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset). All responses used are synthetic.

	This is the model reported in the paper , with K=5 (generate 5 responses per iteration). We attached the Arena-Hard eval results in this model page.

	## Links to Other Models
	- [Mistral7B-PairRM-SPPO-Iter1](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1)
	- [Mistral7B-PairRM-SPPO-Iter2](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter2)
	- [Mistral7B-PairRM-SPPO-Iter3](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3)
	- [Mistral7B-PairRM-SPPO](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO)



	### Model Description

	- Model type: A 7B parameter GPT-like model fine-tuned on synthetic datasets.
	- Language(s) (NLP): Primarily English
	- License: Apache-2.0
	- Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2


	## [AlpacaEval Leaderboard Evaluation Results](https://tatsu-lab.github.io/alpaca_eval/)


	\| Model \| LC. Win Rate \| Win Rate \| Avg. Length \|
	\|-------------------------------------------\|:------------:\|:--------:\|:-----------:\|
	\| Mistral7B-PairRM-SPPO Iter 1 \| 24.79 \| 23.51 \| 1855 \|
	\| Mistral7B-PairRM-SPPO Iter 2 \| 26.89 \| 27.62 \| 2019 \|
	\| Mistral7B-PairRM-SPPO Iter 3 \| 28.53 \| 31.02 \| 2163 \|
	\| Mistral7B-PairRM-SPPO Iter 1 (best-of-16) \| 28.71 \| 27.77 \| 1901 \|
	\| Mistral7B-PairRM-SPPO Iter 2 (best-of-16) \| 31.23 \| 32.12 \| 2035 \|
	\| Mistral7B-PairRM-SPPO Iter 3 (best-of-16) \| 32.13 \| 34.94 \| 2174 \|

	## [Arena-Hard Evaluation Results](https://github.com/lm-sys/arena-hard)

	Model \| Score \| 95% CI \| average \# Tokens \|
	\|----------\|-----------\|--------------\|-----------\|
	Mistral7B-PairRM-SPPO-Iter3\| 23.3 \| (-1.8, 1.8)\|578\|

	## [Open LLM Leaderboard Evaluation Results](https://github.com/EleutherAI/lm-evaluation-harness)

	Results are reported by using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.1

	\| \| arc_challenge \| truthfulqa_mc2 \| winogrande \| gsm8k \| hellaswag \| mmlu \| average \|
	\|--------\|---------------\|----------------\|------------\|-------\|-----------\|-------\|---------\|
	\| Mistral7B-PairRM-SPPO Iter 1 \| 65.02 \| 69.4 \| 77.82 \| 43.82 \| 85.11 \| 58.84 \| 66.67 \|
	\| Mistral7B-PairRM-SPPO Iter 2 \| 65.53 \| 69.55 \| 77.03 \| 44.35 \| 85.29 \| 58.72 \| 66.75 \|
	\| Mistral7B-PairRM-SPPO Iter 3 \| 65.36 \| 69.97 \| 76.8 \| 42.68 \| 85.16 \| 58.45 \| 66.4 \|
	## [MT-Bench Evaluation Results](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge)

	\| \| 1st Turn \| 2nd Turn \| Average \|
	\|--------\|----------\|----------\|---------\|
	\| Mistral7B-PairRM-SPPO Iter 1 \| 7.63 \| 6.79 \| 7.21 \|
	\| Mistral7B-PairRM-SPPO Iter 2 \| 7.90 \| 7.08 \| 7.49 \|
	\| Mistral7B-PairRM-SPPO Iter 3 \| 7.84 \| 7.34 \| 7.59 \|

	### Training hyperparameters
	The following hyperparameters were used during training:

	- learning_rate: 5e-07
	- eta: 1000
	- per_device_train_batch_size: 8
	- gradient_accumulation_steps: 1
	- seed: 42
	- distributed_type: deepspeed_zero3
	- num_devices: 8
	- optimizer: RMSProp
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_train_epochs: 18.0 (stop at epoch=1.0)




	## Citation
	```
	@misc{wu2024self,
	title={Self-Play Preference Optimization for Language Model Alignment},
	author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
	year={2024},
	eprint={2405.00675},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```