Llama-2-7b-gen-dpo-10k

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/real	Rewards/generated	Rewards/accuracies	Rewards/margins	Logps/generated	Logps/real	Logits/generated	Logits/real
0.9274	0.1984	62	0.9072	0.3392	0.2823	0.5769	0.0569	-279.8465	-260.2975	-0.8569	-0.8220
0.7991	0.3968	124	0.7609	1.1652	0.5108	0.75	0.6543	-277.5608	-252.0375	-0.7345	-0.6920
0.7105	0.5952	186	0.6948	2.1035	1.1496	0.75	0.9539	-271.1730	-242.6541	-0.7212	-0.6757
0.6956	0.7936	248	0.6513	2.5451	1.4131	0.7692	1.1320	-268.5384	-238.2380	-0.7591	-0.7111
0.6502	0.992	310	0.6210	2.9937	1.6865	0.8269	1.3073	-265.8045	-233.7518	-0.8166	-0.7765
0.5016	1.1904	372	0.5914	3.5957	2.0722	0.8269	1.5235	-261.9469	-227.7318	-0.8192	-0.7953
0.5296	1.3888	434	0.5809	4.1921	2.5569	0.8462	1.6352	-257.1006	-221.7682	-0.8477	-0.8234
0.4344	1.5872	496	0.5769	4.4690	2.6897	0.8462	1.7792	-255.7717	-218.9994	-0.8474	-0.8298
0.513	1.7856	558	0.5656	4.6486	2.8940	0.8462	1.7546	-253.7296	-217.2037	-0.8719	-0.8539
0.4632	1.984	620	0.5639	4.7129	2.9278	0.8462	1.7851	-253.3908	-216.5599	-0.8339	-0.8251
0.391	2.1824	682	0.5555	4.8380	2.9011	0.8462	1.9369	-253.6578	-215.3090	-0.8728	-0.8686
0.3823	2.3808	744	0.5525	4.9421	2.9736	0.8462	1.9685	-252.9326	-214.2682	-0.8613	-0.8617
0.3705	2.5792	806	0.5512	4.9861	2.9682	0.8654	2.0178	-252.9866	-213.8285	-0.8641	-0.8686
0.3718	2.7776	868	0.5555	5.0071	2.9724	0.8462	2.0347	-252.9452	-213.6185	-0.8636	-0.8680
0.4001	2.976	930	0.5523	4.9937	2.9393	0.8462	2.0544	-253.2762	-213.7525	-0.8742	-0.8781