When Babies Teach Babies: Peer Knowledge Sharing Beats Teacher-Guided Distillation in Small-Data LMs
This model uses weighted mutual learning (WML) to find and train distilled versions of a teacher model using peer-to-peer learning. It builds on the approach described in "Weighted Mutual Learning with Diversity-Driven Model Compression" (Zhang et al., 2022), with some key differences.
Approach
Peer Model Initialization
Unlike the original paper which uses differential pruning of the teacher model, we use Bayesian optimization to initialize smaller peer models:
- For example, if
num_peers = 4
, target parameter counts are N/2, N/3, N/4, N/5 (where N is the teacher model size) - Optimize
num_layers
,attention_heads
, andhidden_size
to reach target parameter counts - This ensures diversity while also reducing model size
The key difference is that pruning (as used in the original paper) only masks parameters, while our distillation approach actually reduces the model architecture size.
Weighted Mutual Learning
We use the bi-level optimization method from the paper to minimize the WML loss and ensemble loss:
- Inner loop: Train peer models using weighted knowledge distillation loss (cross entropy + KL divergence)
- Outer loop: Update peer weights using mirror gradient descent to optimize ensemble performance (ensemble loss)
This allows the framework to dynamically adjust the importance of each peer during training.
Hyperparameters of the champion peer model
Hyperparameter | Value |
---|---|
weight_decay | 0.1 |
beta1 | 0.9 |
beta2 | 0.95 |
bayesian_init_points | 10 |
bayesian_n_iter | 100 |
grad_clip | 1.0 |
prune_importance | 'l1' |
layer_bound | 0.9 |
batch_size | 3 |
block_size | 512 |
num_epochs | 100 |
loss_alpha | 0.5 |
num_batches | 60 |
warmup_iters | 5 |
learning_rate | 0.05 |
lr_decay_iters | 200 |
min_lr | 0.005 |
enable_early_stopping | True |
References
Zhang, M., Wang, L., Campos, D., Huang, W., Guo, C., & Yang, B. (2022). Weighted Mutual Learning with Diversity-Driven Model Compression. Advances in Neural Information Processing Systems, 35.
- Downloads last month
- 2