tinymistral-248-DPO

This model is a fine-tuned version of Locutusque/TinyMistral-248M on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.5815	0.48	10	0.3205	0.7722	-0.2727	1.0	1.0449	-286.5494	-398.5646	-2.3562	-1.8620
0.3287	0.95	20	0.0970	1.0191	-1.8694	1.0	2.8886	-302.5168	-396.0956	-2.0547	-1.5790
0.2126	1.43	30	0.0414	0.3685	-4.5314	1.0	4.8999	-329.1370	-402.6024	-1.8100	-1.4099
0.1844	1.9	40	0.0260	0.9879	-4.8275	1.0	5.8153	-332.0973	-396.4084	-1.8704	-1.4976
0.1546	2.38	50	0.0190	1.1813	-5.2560	1.0	6.4373	-336.3821	-394.4740	-1.9098	-1.5582
0.1532	2.86	60	0.0140	1.0583	-6.0198	1.0	7.0780	-344.0201	-395.7045	-1.8920	-1.5654
0.1402	3.33	70	0.0112	1.0134	-6.5382	1.0	7.5517	-349.2049	-396.1526	-1.8823	-1.5706
0.1544	3.81	80	0.0089	0.8836	-7.1726	1.0	8.0562	-355.5490	-397.4513	-1.8518	-1.5535
0.1357	4.29	90	0.0072	0.7532	-7.7663	1.0	8.5195	-361.4852	-398.7546	-1.8193	-1.5345
0.1418	4.76	100	0.0061	0.6041	-8.3133	1.0	8.9174	-366.9556	-400.2459	-1.7889	-1.5150
0.1482	5.24	110	0.0051	0.4867	-8.7961	1.0	9.2828	-371.7837	-401.4203	-1.7611	-1.4971
0.141	5.71	120	0.0045	0.4212	-9.1494	1.0	9.5706	-375.3166	-402.0751	-1.7409	-1.4842