--- base_model: gpt2 library_name: distily license: mit tags: - generated_from_trainer model-index: - name: distily_bench_gpt2_optim results: [] --- # distily_bench_gpt2_optim This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified). The [Distily](https://github.com/lapp0/distily) library was used for this distillation. It achieves the following results on the evaluation set: - eval_enwikippl: 854.1211 - eval_frwikippl: 4984.0498 - eval_zhwikippl: 8071.4624 - eval_loss: 7592.3198 - eval_runtime: 22.093 - eval_samples_per_second: 45.263 - eval_steps_per_second: 11.316 ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - distillation_objective: 'legacy' - loss_fn: kl - train_embeddings: True - learning_rate: 4e-05 - train_batch_size: 4 - eval_batch_size: 4 - seed: 42 - gradient_accumulation_steps: 4 - total_train_batch_size: 16 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: constant - num_epochs: 1.0 ### Resource Usage Peak GPU Memory: 4.6175 GB ### Eval-Phase Metrics | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | zhwikippl | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | **teacher eval** | | 30.2385 | 57.2728 | | | | | 18.1772 | | 0 | 0 | 55339.3672 | 57682.5742 | 331776.0 | 21.5321 | 46.442 | 11.611 | 57080.2930 | | 500 | 0.0808 | 2252.7444 | 10031.8955 | 12590.8477 | 21.6698 | 46.147 | 11.537 | 54057.1758 | | 1000 | 0.1616 | 1772.2999 | 6312.1328 | 10713.8564 | 21.7782 | 45.917 | 11.479 | 20088.1348 | | 1500 | 0.2424 | 1526.4294 | 5867.9150 | 9839.7441 | 21.9107 | 45.64 | 11.41 | 12735.1455 | | 2000 | 0.3232 | 1386.3311 | 5905.2700 | 9335.0400 | 21.6766 | 46.133 | 11.533 | 12883.1006 | | 2500 | 0.4040 | 1285.9930 | 5870.4004 | 9057.4082 | 21.7814 | 45.911 | 11.478 | 12968.5391 | | 3000 | 0.4848 | 1184.0109 | 5485.2373 | 8730.4961 | 21.825 | 45.819 | 11.455 | 11484.6025 | | 3500 | 0.5657 | 1126.2782 | 5563.9180 | 8546.9443 | 21.8317 | 45.805 | 11.451 | 12904.6191 | | 4000 | 0.6465 | 1054.3176 | 5538.6753 | 8247.4883 | 21.7217 | 46.037 | 11.509 | 12877.9414 | | 4500 | 0.7273 | 994.4172 | 5374.4102 | 8100.9922 | 21.713 | 46.055 | 11.514 | 11938.875 | | 5000 | 0.8081 | 946.4249 | 5192.8228 | 7962.6880 | 21.8256 | 45.818 | 11.454 | 9305.625 | | 5500 | 0.8889 | 910.4888 | 5210.4282 | 7757.3120 | 22.1477 | 45.152 | 11.288 | 9479.9629 | | 6000 | 0.9697 | 871.6422 | 5126.9775 | 7617.0239 | 22.8323 | 43.798 | 10.949 | 8336.0049 | | 6187 | 0.9999 | 854.1211 | 4984.0498 | 7592.3198 | 22.093 | 45.263 | 11.316 | 8071.4624 | ### Framework versions - Distily 0.2.0 - Transformers 4.44.0 - Pytorch 2.3.0 - Datasets 2.20.0