qwen_orpo_entropy_0_01

This model is a fine-tuned version of trl-lib/qwen1.5-0.5b-sft on the yakazimir/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.5589
Sft Loss: 3.3163
Rewards/chosen: -3.1855
Rewards/rejected: -4.1739
Rewards/accuracies: 0.7226
Rewards/margins: 0.9884
Logps/rejected: -4.1739
Logps/chosen: -3.1855
Logits/rejected: 0.1645
Logits/chosen: 0.0521

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-06
train_batch_size: 2
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 16
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3.0

Training results

Training Loss	Epoch	Step	Validation Loss	Sft Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7198	0.2141	400	0.7213	1.4421	-1.5116	-1.6721	0.5556	0.1605	-1.6721	-1.5116	0.3831	0.2918
0.6256	0.4282	800	0.6211	2.0645	-2.1064	-2.5196	0.6654	0.4133	-2.5196	-2.1064	0.4328	0.3408
0.6247	0.6422	1200	0.5880	2.6367	-2.5144	-3.0951	0.6966	0.5807	-3.0951	-2.5144	0.4051	0.3038
0.5355	0.8563	1600	0.5751	2.5635	-2.4305	-2.9974	0.7062	0.5669	-2.9974	-2.4305	0.4192	0.3133
0.6075	1.0704	2000	0.5675	2.6770	-2.5347	-3.1956	0.7166	0.6609	-3.1956	-2.5347	0.3536	0.2455
0.5886	1.2845	2400	0.5600	2.9406	-2.8008	-3.5986	0.7292	0.7978	-3.5986	-2.8008	0.2408	0.1351
0.5549	1.4986	2800	0.5573	2.8692	-2.7229	-3.5062	0.7248	0.7833	-3.5062	-2.7229	0.2546	0.1468
0.5785	1.7127	3200	0.5549	2.8827	-2.7303	-3.5085	0.7240	0.7782	-3.5085	-2.7303	0.2599	0.1531
0.5649	1.9267	3600	0.5509	2.9742	-2.8066	-3.6363	0.7240	0.8296	-3.6363	-2.8066	0.2062	0.0982
0.4683	2.1408	4000	0.5601	3.3501	-3.1588	-4.1100	0.7196	0.9512	-4.1100	-3.1588	0.1350	0.0257
0.491	2.3549	4400	0.5604	3.3569	-3.2270	-4.2111	0.7203	0.9841	-4.2111	-3.2270	0.2088	0.0922
0.4967	2.5690	4800	0.5589	3.2861	-3.1626	-4.1414	0.7226	0.9787	-4.1414	-3.1626	0.1660	0.0539
0.439	2.7831	5200	0.5584	3.3040	-3.1772	-4.1641	0.7211	0.9869	-4.1641	-3.1772	0.1462	0.0352
0.4704	2.9972	5600	0.5590	3.3163	-3.1855	-4.1739	0.7226	0.9884	-4.1739	-3.1855	0.1645	0.0521

Framework versions

Transformers 4.44.2
Pytorch 2.2.2+cu121
Datasets 2.18.0
Tokenizers 0.19.1

yakazimir
/

qwen_orpo_entropy_0_01

qwen_orpo_entropy_0_01

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for yakazimir/qwen_orpo_entropy_0_01

Dataset used to train yakazimir/qwen_orpo_entropy_0_01

Evaluation results