Llama-2-7b-dpo-10k

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/real	Rewards/generated	Rewards/accuracies	Rewards/margins	Logps/generated	Logps/real	Logits/generated	Logits/real
0.8559	0.1984	62	0.8605	0.4128	0.4099	0.4808	0.0029	-158.2126	-175.4314	-0.8219	-0.6123
0.7999	0.3968	124	0.8323	1.5863	1.5154	0.5192	0.0709	-147.1573	-163.6966	-0.8057	-0.6067
0.7846	0.5952	186	0.7979	2.4470	2.3135	0.5577	0.1335	-139.1767	-155.0893	-0.8686	-0.6862
0.7916	0.7936	248	0.7819	3.0117	2.8464	0.6346	0.1653	-133.8475	-149.4422	-0.9049	-0.7322
0.7714	0.992	310	0.7630	3.4214	3.1941	0.6346	0.2273	-130.3704	-145.3455	-0.9511	-0.7905
0.678	1.1904	372	0.7552	3.9523	3.6931	0.6538	0.2592	-125.3802	-140.0360	-0.9800	-0.8279
0.6337	1.3888	434	0.7464	4.4541	4.1602	0.6346	0.2939	-120.7093	-135.0177	-1.0279	-0.8860
0.6575	1.5872	496	0.7352	4.8501	4.4918	0.6538	0.3583	-117.3935	-131.0585	-1.0562	-0.9285
0.6606	1.7856	558	0.7270	5.1119	4.7485	0.6538	0.3634	-114.8267	-128.4403	-1.0969	-0.9780
0.6319	1.984	620	0.7260	5.2581	4.8563	0.6538	0.4018	-113.7479	-126.9782	-1.0953	-0.9815
0.552	2.1824	682	0.7295	5.3469	4.9377	0.6731	0.4092	-112.9344	-126.0898	-1.1133	-1.0072
0.5541	2.3808	744	0.7229	5.4093	4.9819	0.6923	0.4274	-112.4924	-125.4664	-1.1322	-1.0330
0.5342	2.5792	806	0.7246	5.3967	4.9520	0.6923	0.4447	-112.7909	-125.5919	-1.1353	-1.0397
0.5318	2.7776	868	0.7229	5.3656	4.9040	0.6731	0.4615	-113.2710	-125.9033	-1.1367	-1.0427
0.5396	2.976	930	0.7215	5.3782	4.9113	0.6923	0.4668	-113.1980	-125.7774	-1.1385	-1.0466