Llama-2-7b-hf-DPO-LookAhead-5_TTree1.4_TT0.9_TP0.7_TE0.2_V6

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6746	0.3012	75	0.6658	0.0862	0.0321	0.75	0.0541	-127.3055	-142.1577	0.1821	0.1663
0.5925	0.6024	150	0.6506	0.1218	0.0304	0.5833	0.0914	-127.3224	-141.8020	0.1565	0.1401
0.7335	0.9036	225	0.7279	-0.0626	-0.0395	0.5	-0.0231	-128.0216	-143.6459	0.1275	0.1103
0.6498	1.2048	300	0.7880	-0.2917	-0.2254	0.4167	-0.0663	-129.8807	-145.9371	0.0678	0.0485
0.386	1.5060	375	0.7303	-0.2014	-0.2339	0.5	0.0325	-129.9658	-145.0339	0.0325	0.0140
0.2307	1.8072	450	0.8159	-0.5206	-0.4793	0.5	-0.0412	-132.4201	-148.2257	-0.0582	-0.0797
0.1034	2.1084	525	0.9133	-1.0254	-0.8918	0.4167	-0.1335	-136.5451	-153.2736	-0.2025	-0.2290
0.284	2.4096	600	1.0153	-1.5972	-1.3870	0.4167	-0.2102	-141.4962	-158.9917	-0.2790	-0.3083
0.0599	2.7108	675	1.0184	-1.6627	-1.4611	0.5	-0.2016	-142.2372	-159.6465	-0.2970	-0.3265