Post
1080
The Universal Checkpointing paper is out! https://arxiv.org/abs/2406.18820
If you remember the Bigscience BLOOM-176B training, Tunji Ruwase and I co-invented this technology for Megatron-Deepspeed in order to enable to quickly scale up and down node topology while continuing training.
Since then the DeepSpeed team continued improving on that and it has now been fully integrated into Deepspeed.
The blog post is here: https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-ucp/README.md
If you remember the Bigscience BLOOM-176B training, Tunji Ruwase and I co-invented this technology for Megatron-Deepspeed in order to enable to quickly scale up and down node topology while continuing training.
Since then the DeepSpeed team continued improving on that and it has now been fully integrated into Deepspeed.
The blog post is here: https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-ucp/README.md