Distributed Training Papers
Papers related to distributed training
Paper • 2304.11277 • Published • 1Note Paper on FSDP, PyTorch's implementation of ZeRO-3. It addition to that, reading the following blog posts might be an easier introduction: - PyTorch's blog post on FSDP: https://engineering.fb.com/2021/07/15/open-source/fsdp/ - DeepSpeed's tutorial on ZeRO: https://www.deepspeed.ai/tutorials/zero/
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Paper • 1909.08053 • Published • 2Note Initial paper on Tensor Parallelism.
Reducing Activation Recomputation in Large Transformer Models
Paper • 2205.05198 • PublishedNote To read after the Megatron-LM paper, it provides an improvement compared to vanilla Tensor Parallelism called "Sequence Parallelism" which consists in sharding the activations on the sequence axis outside of the Tensor Parallel regions mostly to save memory.
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Paper • 1811.06965 • PublishedNote Initial paper on Pipeline Parallelism.
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Paper • 2104.04473 • PublishedNote Great paper on how to combine the different parallelism methods (DP, TP and PP). It provides an overview of the parallelism methods, a good analysis of the pros and cons, and on the best practices when combining them.