Llama-2-7b-hf-DPO-LookAhead-5_TTree1.4_TT0.9_TP0.7_TE0.2_V4

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6179	0.3027	79	0.7115	-0.1031	-0.0593	0.25	-0.0438	-164.1966	-138.2057	0.5429	0.5748
0.6065	0.6054	158	0.7348	-0.0751	0.0129	0.25	-0.0879	-163.4753	-137.9259	0.5242	0.5565
0.621	0.9080	237	0.7932	-0.0433	0.1366	0.5	-0.1800	-162.2375	-137.6083	0.4932	0.5259
0.4714	1.2107	316	0.7928	-0.6963	-0.5927	0.5	-0.1037	-169.5308	-144.1387	0.4698	0.5037
0.3829	1.5134	395	0.8637	-1.6604	-1.5528	0.3333	-0.1075	-179.1323	-153.7787	0.3664	0.4026
0.3589	1.8161	474	0.9222	-1.4397	-1.1360	0.25	-0.3037	-174.9637	-151.5720	0.3400	0.3770
0.2138	2.1188	553	0.9860	-1.9991	-1.6486	0.3333	-0.3505	-180.0903	-157.1666	0.2605	0.2992
0.0437	2.4215	632	1.1781	-3.1628	-2.7961	0.4167	-0.3666	-191.5652	-168.8030	0.1441	0.1838
0.1667	2.7241	711	1.2125	-3.3104	-2.9319	0.4167	-0.3786	-192.9225	-170.2794	0.1199	0.1595