Edit model card

phi-2-dpo-renew1

This model is a fine-tuned version of lole25/phi-2-sft-lora-ultrachat on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.5780
  • Rewards/chosen: -0.8278
  • Rewards/rejected: -1.2811
  • Rewards/accuracies: 0.6305
  • Rewards/margins: 0.4532
  • Logps/rejected: -371.9221
  • Logps/chosen: -360.3287
  • Logits/rejected: -0.0200
  • Logits/chosen: -0.0541

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6925 0.03 100 0.6928 0.0001 -0.0008 0.4950 0.0008 -243.8912 -277.5416 1.0654 0.9728
0.6903 0.05 200 0.6900 0.0049 -0.0015 0.5830 0.0064 -243.9661 -277.0526 1.0659 0.9732
0.682 0.08 300 0.6801 0.0215 -0.0064 0.6055 0.0280 -244.4588 -275.3941 1.0974 1.0023
0.6574 0.1 400 0.6623 -0.0453 -0.1180 0.6055 0.0727 -255.6189 -282.0750 1.0541 0.9585
0.6262 0.13 500 0.6407 -0.3256 -0.4857 0.6045 0.1601 -292.3858 -310.1027 0.7972 0.7187
0.6441 0.16 600 0.6310 -0.4984 -0.7357 0.6040 0.2373 -317.3828 -327.3852 0.5041 0.4434
0.6238 0.18 700 0.6180 -0.5136 -0.7730 0.6175 0.2594 -321.1137 -328.9063 0.4768 0.4140
0.6022 0.21 800 0.6146 -0.5608 -0.8568 0.6095 0.2960 -329.4937 -333.6271 0.3469 0.2920
0.5893 0.24 900 0.6059 -0.6665 -1.0014 0.6170 0.3349 -343.9540 -344.1970 0.3136 0.2576
0.6435 0.26 1000 0.6007 -0.5361 -0.8713 0.6295 0.3352 -330.9463 -331.1562 0.3378 0.2766
0.5626 0.29 1100 0.5971 -0.6841 -1.0299 0.6195 0.3458 -346.8068 -345.9583 0.3416 0.2879
0.5319 0.31 1200 0.5971 -0.8852 -1.2896 0.6280 0.4044 -372.7756 -366.0687 0.1914 0.1477
0.5818 0.34 1300 0.5949 -0.7178 -1.1027 0.6315 0.3849 -354.0860 -349.3257 0.2165 0.1688
0.5981 0.37 1400 0.5936 -0.6617 -1.0257 0.6290 0.3641 -346.3885 -343.7120 0.1974 0.1465
0.5843 0.39 1500 0.5905 -0.8861 -1.3031 0.6335 0.4171 -374.1299 -366.1545 0.1004 0.0587
0.6283 0.42 1600 0.5882 -0.7845 -1.1706 0.6305 0.3860 -360.8746 -356.0013 0.2242 0.1738
0.5892 0.44 1700 0.5891 -0.6741 -1.0616 0.6310 0.3875 -349.9719 -344.9546 0.1718 0.1259
0.5821 0.47 1800 0.5856 -0.8949 -1.3353 0.6315 0.4404 -377.3439 -367.0341 0.1199 0.0761
0.6072 0.5 1900 0.5861 -0.7180 -1.1339 0.6270 0.4159 -357.2063 -349.3515 0.1237 0.0773
0.6338 0.52 2000 0.5852 -0.7155 -1.1277 0.6340 0.4122 -356.5852 -349.0984 0.0087 -0.0301
0.5582 0.55 2100 0.5860 -0.7383 -1.1682 0.6340 0.4300 -360.6402 -351.3726 -0.0229 -0.0595
0.6103 0.58 2200 0.5821 -0.9235 -1.3855 0.6345 0.4620 -382.3635 -369.8921 -0.0714 -0.1065
0.5636 0.6 2300 0.5836 -0.7656 -1.2038 0.6335 0.4382 -364.1970 -354.1104 -0.0481 -0.0841
0.5846 0.63 2400 0.5804 -0.8773 -1.3343 0.6335 0.4570 -377.2508 -365.2781 -0.0871 -0.1200
0.5799 0.65 2500 0.5834 -0.8420 -1.3045 0.6340 0.4625 -374.2641 -361.7435 -0.0576 -0.0922
0.5565 0.68 2600 0.5810 -0.8009 -1.2549 0.6345 0.4540 -369.3044 -357.6355 -0.0285 -0.0643
0.5614 0.71 2700 0.5782 -0.9522 -1.4183 0.6325 0.4661 -385.6433 -372.7677 -0.0358 -0.0698
0.608 0.73 2800 0.5776 -0.9378 -1.3994 0.6360 0.4616 -383.7585 -371.3293 -0.0229 -0.0571
0.588 0.76 2900 0.5795 -0.8330 -1.2891 0.6345 0.4560 -372.7224 -360.8503 -0.0442 -0.0792
0.5324 0.79 3000 0.5807 -0.7714 -1.2134 0.6340 0.4420 -365.1566 -354.6904 -0.0298 -0.0648
0.6036 0.81 3100 0.5817 -0.7454 -1.1839 0.6360 0.4385 -362.2076 -352.0881 -0.0359 -0.0710
0.615 0.84 3200 0.5806 -0.7630 -1.2065 0.6330 0.4435 -364.4670 -353.8469 -0.0295 -0.0645
0.6211 0.86 3300 0.5794 -0.7767 -1.2207 0.6335 0.4439 -365.8820 -355.2186 -0.0240 -0.0585
0.535 0.89 3400 0.5777 -0.8399 -1.2929 0.6320 0.4530 -373.1028 -361.5366 -0.0225 -0.0558
0.5322 0.92 3500 0.5779 -0.8260 -1.2781 0.6335 0.4522 -371.6272 -360.1418 -0.0210 -0.0546
0.5527 0.94 3600 0.5780 -0.8254 -1.2779 0.6315 0.4525 -371.6083 -360.0847 -0.0229 -0.0565
0.5769 0.97 3700 0.5780 -0.8286 -1.2816 0.6315 0.4530 -371.9745 -360.4062 -0.0225 -0.0562
0.635 0.99 3800 0.5780 -0.8268 -1.2798 0.6300 0.4530 -371.7967 -360.2288 -0.0237 -0.0573

Framework versions

  • PEFT 0.7.1
  • Transformers 4.36.2
  • Pytorch 2.1.2
  • Datasets 2.14.6
  • Tokenizers 0.15.2
Downloads last month
22
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for DUAL-GPO/phi-2-dpo-renew1

Base model

microsoft/phi-2
Adapter
(636)
this model

Dataset used to train DUAL-GPO/phi-2-dpo-renew1