llama3.2-1B-dpo-v1

This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.5037
Rewards/chosen: -1.5976
Rewards/rejected: -4.4612
Rewards/accuracies: 0.7913
Rewards/margins: 2.8637
Logps/rejected: -449.0548
Logps/chosen: -492.4157
Logits/rejected: -0.4987
Logits/chosen: -0.2456

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 6
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.5478	0.0763	700	0.5414	-0.1038	-1.2895	0.7302	1.1857	-417.3377	-477.4783	0.2274	0.3639
0.5243	0.1527	1400	0.6105	-1.3917	-3.5270	0.7313	2.1353	-439.7127	-490.3575	0.3525	0.5136
0.6483	0.2290	2100	0.6370	-3.1503	-5.7506	0.7482	2.6003	-461.9483	-507.9432	0.2785	0.4243
0.687	0.3053	2800	0.5835	-0.2196	-2.3802	0.7391	2.1606	-428.2447	-478.6364	0.3201	0.4561
0.5813	0.3816	3500	0.5808	-0.6116	-3.0983	0.7609	2.4868	-435.4256	-482.5557	0.0172	0.2140
0.7066	0.4580	4200	0.5681	-1.1058	-3.4796	0.7564	2.3738	-439.2385	-487.4986	0.0611	0.2653
0.6408	0.5343	4900	0.5910	-0.7319	-3.5281	0.7659	2.7962	-439.7232	-483.7594	-0.0603	0.1582
0.4565	0.6106	5600	0.5367	-1.0688	-3.9321	0.7867	2.8633	-443.7639	-487.1283	-0.1924	0.0583
0.5482	0.6869	6300	0.5267	-1.4234	-4.2466	0.7888	2.8232	-446.9083	-490.6742	-0.4528	-0.2006
0.5196	0.7633	7000	0.5322	-2.2017	-5.1279	0.7888	2.9261	-455.7211	-498.4576	-0.6046	-0.3654
0.4858	0.8396	7700	0.5116	-2.0986	-5.0640	0.7938	2.9653	-455.0820	-497.4264	-0.5200	-0.2768
0.4581	0.9159	8400	0.5051	-1.6669	-4.5613	0.7913	2.8944	-450.0557	-493.1090	-0.5097	-0.2589
0.3934	0.9923	9100	0.5037	-1.5976	-4.4612	0.7913	2.8637	-449.0548	-492.4157	-0.4987	-0.2456

Framework versions

PEFT 0.8.2
Transformers 4.45.1
Pytorch 2.1.2+cu121
Datasets 2.16.1
Tokenizers 0.20.0

BBexist
/

llama3.2-1B-dpo-v1

llama3.2-1B-dpo-v1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for BBexist/llama3.2-1B-dpo-v1

Evaluation results