Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

Architecture: GPT2LMHeadModel
Total Parameters: 124,439,808
Data Type (dtype): torch.bfloat16
Model Size: 0.24 GB

Evaluation Metrics Comparison

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		36.25	77.0					11.75	21.375
0	0	10788957847552.0	93458488360960.0	23.9652	41.1128	60.808	7.613	3539992576.0	57174604644352.0
2500	0.0404	888.0	5536.0	3.2958	40.0823	62.372	7.809	492.0	4576.0
5000	0.0808	380.0	1448.0	2.4808	41.6839	59.975	7.509	255.0	400.0
7500	0.1212	250.0	748.0	2.1083	44.1725	56.596	7.086	197.0	233.0
10000	0.1616	189.0	616.0	1.8890	43.9453	56.889	7.122	156.0	216.0
12500	0.2020	140.0	488.0	1.6027	42.1657	59.29	7.423	119.0	178.0
15000	0.2424	113.5	434.0	1.4410	42.3062	59.093	7.398	94.0	183.0
17500	0.2828	92.5	340.0	1.3090	42.413	58.944	7.38	76.5	165.0
20000	0.3232	79.5	308.0	1.1661	40.1951	62.197	7.787	73.0	151.0
22500	0.3636	68.0	229.0	0.9997	41.1581	60.741	7.605	56.75	122.5
25000	0.4040	63.25	201.0	0.9359	40.9228	61.091	7.649	50.75	99.5
27500	0.4444	59.25	218.0	0.8936	40.1195	62.314	7.802	46.25	116.5
30000	0.4848	59.25	204.0	0.8841	42.297	59.106	7.4	49.75	87.0
32500	0.5253	57.5	184.0	0.8730	40.8597	61.185	7.66	44.25	101.5
35000	0.5657	56.0	177.0	0.8049	44.9443	55.624	6.964	39.75	62.25
37500	0.6061	55.0	163.0	0.7798	44.8966	55.684	6.972	43.5	93.5
40000	0.6465	52.0	166.0	0.7611	40.5252	61.69	7.724	37.25	73.5
42500	0.6869	51.5	159.0	0.7336	41.7519	59.878	7.497	38.5	70.0
45000	0.7273	46.25	143.0	0.6241	40.2456	62.119	7.777	32.25	54.5
47500	0.7677	45.75	136.0	0.5998	42.1189	59.356	7.431	31.5	43.75
50000	0.8081	45.25	135.0	0.5841	40.1272	62.302	7.8	31.0	43.75
52500	0.8485	44.25	128.0	0.5705	41.9206	59.637	7.466	31.25	43.25
55000	0.8889	43.5	125.5	0.5532	40.1106	62.328	7.803	29.875	38.25
57500	0.9293	43.5	125.5	0.5470	40.2997	62.035	7.767	29.875	38.0
60000	0.9697	43.5	126.0	0.5432	39.9729	62.542	7.83	29.625	37.5
61875	1.0	43.5	126.0	0.5426	41.9287	59.625	7.465	29.625	37.5

Resource Usage Comparison

VRAM Use: 7.7831 GB

Distillation (Teacher -> Student) Architecture Difference:

Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
Total Parameters: 124,439,808 -> 124,439,808
Data Type (dtype): torch.bfloat16 -> torch.bfloat16
Model Size: 0.24 GB -> 0.24 GB

Module Diff Details

Train Dataset

Trained on 145,744,973 tokens from the wikimedia/wikipedia dataset.

Num Samples: 247,500
Subset: 20231101.en
Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=raw_mse, layer_mapper=layer-2))

Hyperparameters

The following hyperparameters were used during training:

Expand

learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0
distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=raw_mse, layer_mapper=layer-2))
train_embeddings: True
lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7fd7d547a6e0>
student_model_name_or_path: None
student_config_name_or_path: None
student_model_config: None
reinitialize_weights: None
copy_teacher_modules: [('lm_head', False)]
student_model_as_bitnet: True
student_model_compile: False
dropout: None
teacher_model_name_or_path: gpt2
teacher_load_in_8bit: False
teacher_load_in_4bit: False
teacher_model_compile: False
dataset_uri: wikimedia/wikipedia
dataset_subset: 20231101.en
dataset_split: train
dataset_column_name: text
dataset_sample_size: 250000
dataset_test_size: 0.01
gradient_accumulation_steps: 1
weight_decay: 0.0
max_grad_norm: 1.0
warmup_ratio: 0.5
warmup_steps: 0
gradient_checkpointing: True

Framework Versions

Distily 0.2.0
Transformers 4.44.2
Pytorch 2.3.0
Datasets 2.21.0

distily
/

distily_miles_projector_experiment