@stas on Hugging Face: "A combined effort from the IBM + Pytorch teams achieved an incredible training…"

Post

A combined effort from the IBM + Pytorch teams achieved an incredible training performance with ZeRO/FSDP on par with 3D parallelism on H100s, while having just 800Gbps inter-node connection.

This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.

Check out their post here: https://pytorch.org/blog/maximizing-training/

Join the conversation