Llama-2-7b-hf-DPO-LookAhead-0_TTree1.4_TT0.9_TP0.7_TE0.2_V4

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7058	0.3016	76	0.6991	0.0563	0.0630	0.5833	-0.0067	-76.4141	-83.8712	0.4583	0.4571
0.7778	0.6032	152	0.6695	-0.3797	-0.4650	0.5833	0.0853	-81.6941	-88.2312	0.4087	0.4065
1.1444	0.9048	228	0.6343	-0.5747	-0.8035	0.6667	0.2288	-85.0798	-90.1819	0.3921	0.3901
0.3356	1.2063	304	0.5906	-0.6785	-1.0749	0.75	0.3964	-87.7931	-91.2192	0.3726	0.3707
0.2763	1.5079	380	0.5523	-1.2776	-1.8289	0.6667	0.5513	-95.3333	-97.2104	0.2582	0.2534
0.3627	1.8095	456	0.6087	-0.9428	-1.2234	0.6667	0.2806	-89.2781	-93.8623	0.2429	0.2369
0.2197	2.1111	532	0.4800	-1.5304	-2.2029	0.75	0.6724	-99.0731	-99.7390	0.0887	0.0802
0.1679	2.4127	608	0.4563	-2.1014	-2.9919	0.6667	0.8905	-106.9635	-105.4488	-0.0385	-0.0475
0.2841	2.7143	684	0.4478	-2.1727	-3.1141	0.75	0.9413	-108.1851	-106.1621	-0.0537	-0.0626