Llama-2-7b-hf-DPO-LookAhead3_FullEval_TTree1.4_TLoop0.7_TEval0.2_Filter0.2_V1.0

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6798	0.3021	50	0.7124	-0.0006	0.0284	0.5	-0.0290	-107.9596	-102.4519	-0.1526	-0.1158
0.622	0.6042	100	0.6927	0.0459	0.0462	0.375	-0.0004	-107.7811	-101.9873	-0.1568	-0.1192
0.6031	0.9063	150	0.6815	0.0644	0.0476	0.375	0.0168	-107.7677	-101.8019	-0.2034	-0.1651
0.3606	1.2085	200	0.8146	-0.7471	-0.7226	0.625	-0.0245	-115.4695	-109.9166	-0.3703	-0.3356
0.3387	1.5106	250	0.6641	-0.3875	-0.5323	0.625	0.1448	-113.5663	-106.3212	-0.3313	-0.2957
0.1549	1.8127	300	0.6263	-0.8093	-1.0444	0.625	0.2351	-118.6870	-110.5388	-0.3892	-0.3537
0.0958	2.1148	350	0.7394	-1.7451	-2.0072	0.5	0.2621	-128.3158	-119.8970	-0.5348	-0.5043
0.0193	2.4169	400	0.9249	-2.0984	-2.2555	0.5	0.1571	-130.7979	-123.4299	-0.6495	-0.6219
0.3616	2.7190	450	0.9877	-2.3284	-2.4506	0.5	0.1222	-132.7494	-125.7301	-0.6876	-0.6598