metadata

tags:
  - trl
  - dpo
  - generated_from_trainer
model-index:
  - name: dpo-selective-buffer-safeipo
    results: []

dpo-selective-buffer-safeipo

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Rewards/safe Rewards	Rewards/unsafe Rewards	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
13096.7359	0.16	300	4529.6733	-0.3957	-0.4772	0.6584	0.0815	-0.3930	-0.3956	-140.1830	-170.0027	-2.1815	-2.3195
11584.7875	0.32	600	4406.7134	-0.8083	-0.8819	0.6338	0.0736	-0.8028	-0.8050	-180.6571	-211.2575	-1.7938	-1.9934
10862.3484	0.48	900	4377.5635	-0.8828	-0.9530	0.6196	0.0701	-0.8775	-0.8778	-187.7609	-218.7140	-1.7468	-1.9377
11671.4219	0.65	1200	4346.4053	-0.9811	-1.0509	0.6158	0.0699	-0.9764	-0.9768	-197.5588	-228.5369	-1.6740	-1.8665
10202.4125	0.81	1500	4320.9878	-0.9655	-1.0271	0.6023	0.0617	-0.9611	-0.9618	-195.1794	-226.9775	-1.7645	-1.9420
11785.8336	0.97	1800	4320.8208	-0.9417	-1.0065	0.6027	0.0648	-0.9369	-0.9373	-193.1151	-224.6014	-1.7745	-1.9550