Adding Evaluation Results

6a8bccb verified about 2 months ago

7.61 kB

	---
	language:
	- en
	license: apache-2.0
	datasets:
	- openbmb/UltraFeedback
	pipeline_tag: text-generation
	model-index:
	- name: Mistral7B-PairRM-SPPO-Iter1
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval (0-Shot)
	type: HuggingFaceH4/ifeval
	args:
	num_few_shot: 0
	metrics:
	- type: inst_level_strict_acc and prompt_level_strict_acc
	value: 50.47
	name: strict accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: BBH (3-Shot)
	type: BBH
	args:
	num_few_shot: 3
	metrics:
	- type: acc_norm
	value: 22.93
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH Lvl 5 (4-Shot)
	type: hendrycks/competition_math
	args:
	num_few_shot: 4
	metrics:
	- type: exact_match
	value: 2.19
	name: exact match
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GPQA (0-shot)
	type: Idavidrein/gpqa
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 4.47
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MuSR (0-shot)
	type: TAUR-Lab/MuSR
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 8.3
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU-PRO (5-shot)
	type: TIGER-Lab/MMLU-Pro
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 18.84
	name: accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
	name: Open LLM Leaderboard
	---
	Self-Play Preference Optimization for Language Model Alignment (https://arxiv.org/abs/2405.00675)

	# Mistral7B-PairRM-SPPO-Iter1

	This model was developed using [Self-Play Preference Optimization](https://arxiv.org/abs/2405.00675) at iteration 1, based on the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) architecture as starting point. We utilized the prompt sets from the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset, splited to 3 parts for 3 iterations by [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset). All responses used are synthetic.

	This is the model reported in the paper , with K=5 (generate 5 responses per iteration). We attached the Arena-Hard eval results in this model page.

	## Links to Other Models
	- [Mistral7B-PairRM-SPPO-Iter1](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1)
	- [Mistral7B-PairRM-SPPO-Iter2](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter2)
	- [Mistral7B-PairRM-SPPO-Iter3](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3)
	- [Mistral7B-PairRM-SPPO](https://huggingface.co/UCLA-AGI/Mistral7B-PairRM-SPPO)

	### Model Description

	- Model type: A 7B parameter GPT-like model fine-tuned on synthetic datasets.
	- Language(s) (NLP): Primarily English
	- License: Apache-2.0
	- Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2


	## [AlpacaEval Leaderboard Evaluation Results](https://tatsu-lab.github.io/alpaca_eval/)


	\| Model \| LC. Win Rate \| Win Rate \| Avg. Length \|
	\|-------------------------------------------\|:------------:\|:--------:\|:-----------:\|
	\| Mistral7B-PairRM-SPPO Iter 1 \| 24.79 \| 23.51 \| 1855 \|
	\| Mistral7B-PairRM-SPPO Iter 2 \| 26.89 \| 27.62 \| 2019 \|
	\| Mistral7B-PairRM-SPPO Iter 3 \| 28.53 \| 31.02 \| 2163 \|
	\| Mistral7B-PairRM-SPPO Iter 1 (best-of-16) \| 28.71 \| 27.77 \| 1901 \|
	\| Mistral7B-PairRM-SPPO Iter 2 (best-of-16) \| 31.23 \| 32.12 \| 2035 \|
	\| Mistral7B-PairRM-SPPO Iter 3 (best-of-16) \| 32.13 \| 34.94 \| 2174 \|

	## [Arena-Hard Evaluation Results](https://github.com/lm-sys/arena-hard)

	Model \| Score \| 95% CI \| average \# Tokens \|
	\|----------\|-----------\|--------------\|-----------\|
	Mistral7B-PairRM-SPPO-Iter3\| 23.3 \| (-1.8, 1.8)\|578\|

	## [Open LLM Leaderboard Evaluation Results](https://github.com/EleutherAI/lm-evaluation-harness)

	Results are reported by using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.1

	\| \| arc_challenge \| truthfulqa_mc2 \| winogrande \| gsm8k \| hellaswag \| mmlu \| average \|
	\|--------\|---------------\|----------------\|------------\|-------\|-----------\|-------\|---------\|
	\| Mistral7B-PairRM-SPPO Iter 1 \| 65.02 \| 69.4 \| 77.82 \| 43.82 \| 85.11 \| 58.84 \| 66.67 \|
	\| Mistral7B-PairRM-SPPO Iter 2 \| 65.53 \| 69.55 \| 77.03 \| 44.35 \| 85.29 \| 58.72 \| 66.75 \|
	\| Mistral7B-PairRM-SPPO Iter 3 \| 65.36 \| 69.97 \| 76.8 \| 42.68 \| 85.16 \| 58.45 \| 66.4 \|
	## [MT-Bench Evaluation Results](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge)

	\| \| 1st Turn \| 2nd Turn \| Average \|
	\|--------\|----------\|----------\|---------\|
	\| Mistral7B-PairRM-SPPO Iter 1 \| 7.63 \| 6.79 \| 7.21 \|
	\| Mistral7B-PairRM-SPPO Iter 2 \| 7.90 \| 7.08 \| 7.49 \|
	\| Mistral7B-PairRM-SPPO Iter 3 \| 7.84 \| 7.34 \| 7.59 \|

	### Training hyperparameters
	The following hyperparameters were used during training:

	- learning_rate: 5e-07
	- eta: 1000
	- per_device_train_batch_size: 8
	- gradient_accumulation_steps: 1
	- seed: 42
	- distributed_type: deepspeed_zero3
	- num_devices: 8
	- optimizer: RMSProp
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_train_epochs: 18.0 (stop at epoch=1.0)




	## Citation
	```
	@misc{wu2024self,
	title={Self-Play Preference Optimization for Language Model Alignment},
	author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
	year={2024},
	eprint={2405.00675},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_UCLA-AGI__Mistral7B-PairRM-SPPO-Iter1)

	\| Metric \|Value\|
	\|-------------------\|----:\|
	\|Avg. \|17.87\|
	\|IFEval (0-Shot) \|50.47\|
	\|BBH (3-Shot) \|22.93\|
	\|MATH Lvl 5 (4-Shot)\| 2.19\|
	\|GPQA (0-shot) \| 4.47\|
	\|MuSR (0-shot) \| 8.30\|
	\|MMLU-PRO (5-shot) \|18.84\|