Post
A combined effort from the IBM + Pytorch teams achieved an incredible training performance with ZeRO/FSDP on par with 3D parallelism on H100s, while having just 800Gbps inter-node connection.
This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.
Check out their post here: https://pytorch.org/blog/maximizing-training/
This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.
Check out their post here: https://pytorch.org/blog/maximizing-training/