distily_bench_obj_cross_v2

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0004
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine_with_restarts
num_epochs: 1.0

Peak GPU Memory: 8.1729 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	34961.5352	67685.8906	6.4082	13.0484	76.638	9.58	22307.2852	64899.9219
1000	0.0808	5926.0762	32934.9414	4.9183	13.0602	76.568	9.571	2332.1941	55049.4961
2000	0.1616	5852.6240	32990.6758	4.9180	13.0475	76.643	9.58	2287.5139	54785.7930
3000	0.2424	5843.5669	32990.6758	4.9177	13.045	76.658	9.582	2281.4705	54785.7930
4000	0.3232	5878.9780	32990.6758	4.9180	13.0627	76.554	9.569	2303.0730	54815.0078
5000	0.4040	5868.0605	32990.6758	4.9180	13.0226	76.789	9.599	2295.0898	54815.0078
6000	0.4848	5867.1484	32990.6758	4.9180	13.0139	76.841	9.605	2291.6780	54785.7930
7000	0.5657	5869.8799	32990.6758	4.9177	13.0183	76.815	9.602	2297.7485	54815.0078
8000	0.6465	5868.0605	32990.6758	4.9180	13.084	76.429	9.554	2294.3315	54815.0078
9000	0.7273	5868.0605	32990.6758	4.9180	13.0935	76.374	9.547	2293.1941	54785.7930
10000	0.8081	5845.3784	32990.6758	4.9177	13.0045	76.896	9.612	2282.6021	54785.7930
11000	0.8889	5848.9976	32990.6758	4.9177	13.0015	76.914	9.614	2284.8682	54785.7930
12000	0.9697	5868.0605	32990.6758	4.9183	13.0386	76.696	9.587	2296.9883	54815.0078
12375	1.0	5868.0605	32990.6758	4.9183	13.0038	76.9	9.613	2296.9883	54815.0078