Distributed Training Papers

hf-ml-opt 's Collections

Quantization Papers

updated Jun 3

Papers related to distributed training

Upvote

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Paper • 2304.11277 • Published Apr 21, 2023 • 1

Note Paper on FSDP, PyTorch's implementation of ZeRO-3. It addition to that, reading the following blog posts might be an easier introduction: - PyTorch's blog post on FSDP: https://engineering.fb.com/2021/07/15/open-source/fsdp/ - DeepSpeed's tutorial on ZeRO: https://www.deepspeed.ai/tutorials/zero/
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Paper • 1909.08053 • Published Sep 17, 2019 • 2

Note Initial paper on Tensor Parallelism.
Reducing Activation Recomputation in Large Transformer Models

Paper • 2205.05198 • Published May 10, 2022

Note To read after the Megatron-LM paper, it provides an improvement compared to vanilla Tensor Parallelism called "Sequence Parallelism" which consists in sharding the activations on the sequence axis outside of the Tensor Parallel regions mostly to save memory.
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Paper • 1811.06965 • Published Nov 16, 2018

Note Initial paper on Pipeline Parallelism.
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Paper • 2104.04473 • Published Apr 9, 2021

Note Great paper on how to combine the different parallelism methods (DP, TP and PP). It provides an overview of the parallelism methods, a good analysis of the pros and cons, and on the best practices when combining them.

Upvote