metadata

license: mit
library_name: peft
tags:
  - trl
  - dpo
  - generated_from_trainer
base_model: microsoft/phi-2
model-index:
  - name: phi2-lora-quantized-distilabel-intel-orca-dpo-pairs
    results: []

phi2-lora-quantized-distilabel-intel-orca-dpo-pairs

This model is a fine-tuned version of microsoft/phi-2 on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.5173
Rewards/chosen: -0.0019
Rewards/rejected: -0.7725
Rewards/accuracies: 0.7816
Rewards/margins: 0.7706
Logps/rejected: -233.5226
Logps/chosen: -214.1249
Logits/rejected: 0.3181
Logits/chosen: 0.2015

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 2
eval_batch_size: 2
seed: 42
gradient_accumulation_steps: 16
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 20
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6887	0.14	20	0.6767	0.0030	-0.0331	0.6341	0.0361	-226.1282	-214.0752	0.2238	0.1343
0.6472	0.27	40	0.6171	0.0141	-0.1710	0.7639	0.1852	-227.5079	-213.9642	0.2464	0.1508
0.5759	0.41	60	0.5584	0.0123	-0.4023	0.7808	0.4146	-229.8206	-213.9829	0.2774	0.1736
0.526	0.54	80	0.5326	0.0036	-0.5790	0.7816	0.5826	-231.5877	-214.0700	0.2983	0.1884
0.4963	0.68	100	0.5225	0.0020	-0.6964	0.7825	0.6984	-232.7611	-214.0853	0.3131	0.1986
0.4977	0.81	120	0.5188	-0.0025	-0.7533	0.7816	0.7508	-233.3300	-214.1302	0.3162	0.2002
0.4818	0.95	140	0.5173	-0.0019	-0.7725	0.7816	0.7706	-233.5226	-214.1249	0.3181	0.2015

Framework versions

PEFT 0.7.1
Transformers 4.37.1
Pytorch 2.1.0+cu118
Datasets 2.16.1
Tokenizers 0.15.1