distily_TinyStories-33M

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 5885.9341
eval_frwikippl: 24294.9414
eval_zhwikippl: 264331.3438
eval_loss: 0.3987
eval_runtime: 51.5838
eval_samples_per_second: 48.465
eval_steps_per_second: 6.068

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.1416 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		20633.1680	131577.2812					7615.4468
0	0	55266.375	57180.4375	6.2843	26.4237	94.612	11.845	56806.5430
1000	0.0323	11414.3389	87921.1172	0.7142	26.3405	94.911	11.883	611931.1875
2000	0.0646	8814.8682	53295.2305	0.6287	51.0412	48.98	6.132	507315.5625
3000	0.0970	8020.6040	41652.3320	0.5662	29.4187	84.98	10.639	268242.625
4000	0.1293	7153.7090	33178.5977	0.5197	40.0478	62.425	7.816	315367.9062
5000	0.1616	6865.2617	31042.1875	0.4833	36.655	68.203	8.539	372857.25
6000	0.1939	6828.5781	30924.2324	0.4539	47.1811	52.987	6.634	379690.5
7000	0.2263	6329.1855	28375.3984	0.4331	51.6027	48.447	6.066	325812.875
8000	0.2586	6229.7119	28592.2773	0.4123	51.6184	48.432	6.064	318159.5
9000	0.2909	5885.9341	24294.9414	0.3987	51.5838	48.465	6.068	264331.3438
10000	0.3232	5634.5898	24401.3828	0.3856	51.6233	48.428	6.063	248118.4062
11000	0.3555	5849.9346	26113.8555	0.3761	51.5949	48.454	6.066	255583.9844
12000	0.3879	5588.8325	23138.0430	0.3666	51.5384	48.508	6.073	255106.6875
13000	0.4202	5498.4355	23102.1699	0.3618	51.6778	48.377	6.057	244239.3125
14000	0.4525	5495.8716	24775.8398	0.3530	51.4537	48.587	6.083	271776.25
15000	0.4848	5449.1309	23173.9512	0.3490	51.6347	48.417	6.062	235716.0625
16000	0.5172	5464.8057	25348.3184	0.3430	48.3546	51.701	6.473	305992.3125
17000	0.5495	5289.8618	23652.6602	0.3426	45.4673	54.985	6.884	290930.0625
18000	0.5818	5362.6548	23393.9375	0.3378	42.8681	58.318	7.301	237739.0938
19000	0.6141	5970.6357	32165.1016	0.3332	38.4757	64.976	8.135	492760.0312
20000	0.6465	5680.7217	30225.7988	0.3322	31.9943	78.139	9.783	391742.4062
21000	0.6788	5494.1685	27750.1914	0.3288	49.7191	50.283	6.295	288762.6875
22000	0.7111	5693.0815	24919.4883	0.3272	49.6244	50.378	6.307	263274.4375
23000	0.7434	5303.4346	25441.4375	0.3230	50.6137	49.394	6.184	261801.9844
24000	0.7757	5458.4463	26499.6543	0.3217	51.4227	48.617	6.087	229626.5781
25000	0.8081	5728.1162	28263.5859	0.3203	51.6717	48.382	6.057	258605.3594
26000	0.8404	5226.1689	23493.1152	0.3186	51.4811	48.562	6.08	180660.6719
27000	0.8727	5192.1890	22039.3262	0.3165	51.6376	48.414	6.061	194013.875
28000	0.9050	5418.7476	22450.2344	0.3169	51.6539	48.399	6.06	182503.5312
29000	0.9374	5170.8613	23860.3691	0.3141	51.4944	48.549	6.078	197516.9531
30000	0.9697	5569.3379	25081.6641	0.3130	51.3337	48.701	6.097	160202.3281
30938	1.0	5306.7280	25078.125	0.3130	51.5266	48.519	6.075	179410.5625

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0

distily
/

distily_TinyStories-33M