Trapezoidal scheduler with cooldown phase

#4
by maveriq - opened

Hi. Thanks for yet another insightful contribution. I am interested in extending this work with a couple of variations that I have in mind.

Can you say a bit more about the trapezoidal LR scheduling? In particular how is it different than OneCycleLR. Secondly is the cooldown phase the same as using the 'three_phase' option of OneCycleLR? And lastly, what is the warmup percentage/steps.

Would it be possible to open-source the training pipeline as well? Training from scratch at these sizes (135M/360M), is within the reach of many practicioners/researchers and having access to complete pipeline will help in reducing confounding factors.

Thanks!

For anyone having same questions, I found most of the answers in this paper, except for the warmup percentage/steps.

Here is a quick implementation of TrapezoidLRScheduler

Hugging Face TB Research org
β€’
edited 29 days ago

Hey! For the warmup we set it to 5000 steps, to be honest we didn't do much ablation on it, i think it don't have that much impact for very long training (might be wrong). For the training code will post it on github this week! We also have an implementation of WSD in nanotron LrSchedulerArgs.

Just landed on this discussion as I had the same question regarding the LR schedule. I found the original implementation useful: https://github.com/epfml/schedules-and-scaling/blob/6e8b7f952420c928cc09a0e4bda9678e2bf42e5f/src/optim/utils.py#L55

Sign up or log in to comment