arxiv:2408.13359

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

Published on Aug 23

· Submitted by

akhaliq on Aug 27

Upvote

Authors:

Yikang Shen ,

Matthew Stallone ,

Mayank Mishra ,

Gaoyuan Zhang ,

Adriana Meza Soria ,

Rameswar Panda

Abstract

Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters. Recent studies propose using small proxy models and small corpus to perform hyperparameter searches and transposing the optimal parameters to large models and large corpus. While the zero-shot transferability is theoretically and empirically proven for model size related hyperparameters, like depth and width, the zero-shot transfer from small corpus to large corpus is underexplored. In this paper, we study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. After thousands of small experiments, we found a power-law relationship between variables and demonstrated its transferability across model sizes. Based on the observation, we propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. The experiment shows that combining the Power scheduler with Maximum Update Parameterization (muP) can consistently achieve impressive performance with one set of hyperparameters regardless of the number of training tokens, batch size, model size, and even model architecture. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models. We open-source these pretrained models at https://ibm.biz/BdKhLa.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter Aug 27

https://huggingface.co/collections/ibm/power-lm-66be64ae647ddf11b9808000

librarian-bot

Aug 28

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

KT313

Aug 28

Is this the right place for feedback even if the paper was submitted by @akhaliq instead of one of the authors?

In case it is: In Figure 2, if the first plot ranges from lr 0.0002 to 0.0256, might as well do the same in the second plot for consistency instead of stopping at 0.0128. Also same perplexity range in both plots would be nice instead of 40-100 in the first one and 40-60 in the second one.
Same in fig 4, where a b c range ~1e-4 to ~5e-6 on the y-axis, while d ranges from 8e-5 to 6e-6. maybe add a grid for better comparability.
fig 5 as well, if the first plot ranges 43 to 55 and the second one ranges 44 to 56, might as well make both 43 to 56 :)

Meteonis

Sep 1

•

edited Sep 1

So excited to see such interesting work. But I have a question: in Hypothesis 1 you mentioned "we only keep the three best batch sizes to focus on the optimal scenario", and then directly use these "best three batch sizes" to fit a and b. But in actual scenarios, we don’t know what the optimal batch size is. Is the lr obtained in this case still the best?