spin-phi2 / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
8af97bc verified
|
raw
history blame
4.61 kB
metadata
language:
  - en
license: apache-2.0
tags:
  - alignment-handbook
  - generated_from_trainer
base_model: microsoft/phi-2
pipeline_tag: text-generation
model-index:
  - name: spin-phi2
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 63.57
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=amu/spin-phi2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 75.57
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=amu/spin-phi2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 57.93
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=amu/spin-phi2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 46.22
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=amu/spin-phi2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 73.48
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=amu/spin-phi2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 53.3
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=amu/spin-phi2
          name: Open LLM Leaderboard

outputs

This model is a fine-tuned version of microsoft/phi-2 using SPIN on ultrachat_200k dataset.

What's new

I think SPIN not only can use on a SFT model, but also it can use on a pretrained model. Therefore, I use SPIN on a pretrained model microsoft/phi-2. And I get a higher score better than origin pretrained model. You can check the open llm leaderboard.

I Think the best paradigm for training a conversational Large Language Model (LLM): pretrain -> dpo(spin) -> sft -> dpo(spin)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-07
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 64
  • total_eval_batch_size: 8
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Framework versions

  • Transformers 4.37.0
  • Pytorch 2.1.2+cu121
  • Datasets 2.14.6
  • Tokenizers 0.15.2

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 61.68
AI2 Reasoning Challenge (25-Shot) 63.57
HellaSwag (10-Shot) 75.57
MMLU (5-Shot) 57.93
TruthfulQA (0-shot) 46.22
Winogrande (5-shot) 73.48
GSM8k (5-shot) 53.30