lapp0
/

distily_bench_gpt2_activation_loss_b

@@ -16,13 +16,13 @@ This student model is distilled from the teacher model [gpt2](https://huggingfac
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 361.1823
-- eval_frwikippl: 2386.7026
-- eval_zhwikippl: 4744.1943
-- eval_loss: 1.9157
-- eval_runtime: 17.7641
-- eval_samples_per_second: 56.293
-- eval_steps_per_second: 7.037
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -45,7 +45,7 @@ More information needed
 ### Training hyperparameters
 The following hyperparameters were used during training:
-- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=jsd, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
 - train_embeddings: True
 - learning_rate: 4e-05
 - train_batch_size: 8
@@ -62,20 +62,20 @@ Peak GPU Memory: 8.0903 GB
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 30.2086 | 57.2728 |  |  |  |  | 18.1784 |
-| 0 | 0 | 55429.6875 | 57698.8047 | 175.5960 | 17.7132 | 56.455 | 7.057 | 56988.9141 |
-| 1000 | 0.0808 | 1378.2804 | 7985.9370 | 2.9261 | 17.7408 | 56.367 | 7.046 | 50670.5117 |
-| 2000 | 0.1616 | 847.0040 | 4771.7144 | 2.5598 | 17.7417 | 56.364 | 7.046 | 14207.9209 |
-| 3000 | 0.2424 | 688.7217 | 4189.9639 | 2.4037 | 17.7422 | 56.363 | 7.045 | 14306.9199 |
-| 4000 | 0.3232 | 592.5426 | 3683.2917 | 2.2774 | 17.7332 | 56.391 | 7.049 | 10059.0352 |
-| 5000 | 0.4040 | 523.8404 | 3382.5713 | 2.1775 | 17.7584 | 56.311 | 7.039 | 5015.1992 |
-| 6000 | 0.4848 | 467.2578 | 3189.8525 | 2.0989 | 17.7486 | 56.343 | 7.043 | 3572.5811 |
-| 7000 | 0.5657 | 428.3366 | 2903.5076 | 2.0325 | 17.7274 | 56.41 | 7.051 | 4927.5718 |
-| 8000 | 0.6465 | 387.8995 | 2702.7917 | 1.9715 | 17.7229 | 56.424 | 7.053 | 5649.6235 |
-| 9000 | 0.7273 | 361.1823 | 2386.7026 | 1.9157 | 17.7641 | 56.293 | 7.037 | 4744.1943 |
-| 10000 | 0.8081 | 339.6383 | 2152.0254 | 1.8789 | 17.7218 | 56.428 | 7.053 | 2744.7366 |
-| 11000 | 0.8889 | 329.6602 | 2210.7751 | 1.8416 | 17.7377 | 56.377 | 7.047 | 3214.8923 |
-| 12000 | 0.9697 | 309.4193 | 2047.5265 | 1.8192 | 17.7465 | 56.349 | 7.044 | 2113.7966 |
-| 12375 | 1.0 | 306.6684 | 1978.8336 | 1.8053 | 17.7467 | 56.348 | 7.044 | 1737.9734 |
 ### Framework versions
 - Distily 0.2.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 539.2350
+- eval_frwikippl: 3470.0164
+- eval_zhwikippl: 15822.4590
+- eval_loss: 3.7526
+- eval_runtime: 17.2807
+- eval_samples_per_second: 57.868
+- eval_steps_per_second: 7.234
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=reverse_kl, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
 - train_embeddings: True
 - learning_rate: 4e-05
 - train_batch_size: 8
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 30.2086 | 57.2728 |  |  |  |  | 18.1784 |
+| 0 | 0 | 55429.6875 | 57698.8047 | 678.6000 | 17.2753 | 57.886 | 7.236 | 56988.9141 |
+| 1000 | 0.0808 | 2093.6777 | 12120.9648 | 7.9200 | 17.2744 | 57.889 | 7.236 | 152438.5625 |
+| 2000 | 0.1616 | 1304.2460 | 8493.5225 | 7.5052 | 17.3176 | 57.745 | 7.218 | 58375.3203 |
+| 3000 | 0.2424 | 924.3269 | 6221.1147 | 7.2972 | 17.3721 | 57.564 | 7.195 | 42663.6406 |
+| 4000 | 0.3232 | 765.8523 | 4975.8003 | 7.1927 | 17.3013 | 57.799 | 7.225 | 31888.1699 |
+| 5000 | 0.4040 | 677.1083 | 4361.1675 | 7.1110 | 17.3208 | 57.734 | 7.217 | 29598.4395 |
+| 6000 | 0.4848 | 634.8929 | 3934.5681 | 3.8812 | 17.3181 | 57.743 | 7.218 | 20375.8535 |
+| 7000 | 0.5657 | 610.9518 | 3706.7395 | 3.8349 | 17.3127 | 57.761 | 7.22 | 22378.3477 |
+| 8000 | 0.6465 | 574.6434 | 3612.8188 | 3.7883 | 17.3011 | 57.8 | 7.225 | 20749.2754 |
+| 9000 | 0.7273 | 539.2350 | 3470.0164 | 3.7526 | 17.2807 | 57.868 | 7.234 | 15822.4590 |
+| 10000 | 0.8081 | 522.2968 | 3161.6411 | 3.7214 | 17.3805 | 57.536 | 7.192 | 8437.9199 |
+| 11000 | 0.8889 | 509.9135 | 3373.0425 | 3.6955 | 17.2891 | 57.84 | 7.23 | 6858.4756 |
+| 12000 | 0.9697 | 478.6796 | 3359.7512 | 3.6670 | 17.3301 | 57.703 | 7.213 | 7444.5142 |
+| 12375 | 1.0 | 482.2987 | 3227.4072 | 3.6571 | 17.2805 | 57.869 | 7.234 | 8891.3779 |
 ### Framework versions
 - Distily 0.2.0

logs/hs_loss_fn=reverse_kl, hs_weight=2.0/events.out.tfevents.1723683654.5f530b1cf724 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:850cf97a3158676f4474d5a21fadb2c5f4d6e1394fd4480e88c1011945f4a797
+size 249