leaderboard-pr-bot's picture
Adding Evaluation Results
6a8bccb verified
|
raw
history blame
7.61 kB
metadata
language:
  - en
license: apache-2.0
datasets:
  - openbmb/UltraFeedback
pipeline_tag: text-generation
model-index:
  - name: Mistral7B-PairRM-SPPO-Iter1
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 50.47
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 22.93
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 2.19
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 4.47
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 8.3
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 18.84
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
          name: Open LLM Leaderboard

Self-Play Preference Optimization for Language Model Alignment (https://arxiv.org/abs/2405.00675)

Mistral7B-PairRM-SPPO-Iter1

This model was developed using Self-Play Preference Optimization at iteration 1, based on the mistralai/Mistral-7B-Instruct-v0.2 architecture as starting point. We utilized the prompt sets from the openbmb/UltraFeedback dataset, splited to 3 parts for 3 iterations by snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset. All responses used are synthetic.

This is the model reported in the paper , with K=5 (generate 5 responses per iteration). We attached the Arena-Hard eval results in this model page.

Links to Other Models

Model Description

  • Model type: A 7B parameter GPT-like model fine-tuned on synthetic datasets.
  • Language(s) (NLP): Primarily English
  • License: Apache-2.0
  • Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2

AlpacaEval Leaderboard Evaluation Results

Model LC. Win Rate Win Rate Avg. Length
Mistral7B-PairRM-SPPO Iter 1 24.79 23.51 1855
Mistral7B-PairRM-SPPO Iter 2 26.89 27.62 2019
Mistral7B-PairRM-SPPO Iter 3 28.53 31.02 2163
Mistral7B-PairRM-SPPO Iter 1 (best-of-16) 28.71 27.77 1901
Mistral7B-PairRM-SPPO Iter 2 (best-of-16) 31.23 32.12 2035
Mistral7B-PairRM-SPPO Iter 3 (best-of-16) 32.13 34.94 2174

Arena-Hard Evaluation Results

Model Score 95% CI average # Tokens
Mistral7B-PairRM-SPPO-Iter3 23.3 (-1.8, 1.8) 578

Open LLM Leaderboard Evaluation Results

Results are reported by using lm-evaluation-harness v0.4.1

arc_challenge truthfulqa_mc2 winogrande gsm8k hellaswag mmlu average
Mistral7B-PairRM-SPPO Iter 1 65.02 69.4 77.82 43.82 85.11 58.84 66.67
Mistral7B-PairRM-SPPO Iter 2 65.53 69.55 77.03 44.35 85.29 58.72 66.75
Mistral7B-PairRM-SPPO Iter 3 65.36 69.97 76.8 42.68 85.16 58.45 66.4

MT-Bench Evaluation Results

1st Turn 2nd Turn Average
Mistral7B-PairRM-SPPO Iter 1 7.63 6.79 7.21
Mistral7B-PairRM-SPPO Iter 2 7.90 7.08 7.49
Mistral7B-PairRM-SPPO Iter 3 7.84 7.34 7.59

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • eta: 1000
  • per_device_train_batch_size: 8
  • gradient_accumulation_steps: 1
  • seed: 42
  • distributed_type: deepspeed_zero3
  • num_devices: 8
  • optimizer: RMSProp
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • num_train_epochs: 18.0 (stop at epoch=1.0)

Citation

@misc{wu2024self,
      title={Self-Play Preference Optimization for Language Model Alignment}, 
      author={Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan},
      year={2024},
      eprint={2405.00675},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 17.87
IFEval (0-Shot) 50.47
BBH (3-Shot) 22.93
MATH Lvl 5 (4-Shot) 2.19
GPQA (0-shot) 4.47
MuSR (0-shot) 8.30
MMLU-PRO (5-shot) 18.84