Eight GPUs sit in a server. Each one is holding the same model, weight for weight. Each one has just finished a backward pass on a different slice of the batch. They now hold eight different gradient tensors, all the same shape, all disagreeing about what the next step should be.
Before the optimizer can move, those eight tensors have to become one. Specifically, they have to become their element-wise mean, with every GPU ending up holding the same averaged tensor. That sounds like a small problem. It is, in fact, the problem distributed training spends most of its time solving, and the algorithm that solves it predates deep learning by close to a decade.
This post is part of a seven-blog series on distributed training. The earlier post in the series covered why we need data parallelism at all. This one stays at the small scale, two to a few dozen GPUs, where a model still fits on one card and the bottleneck is throughput. The later posts cover what happens when the model itself outgrows the card.
Where the algorithm came from
The setup looks recent because the framework is. PyTorch's DistributedDataParallel, or DDP, dates from around 2018, and the paper describing it landed at VLDB 2020 with Shen Li and ten co-authors from the PyTorch team. The math underneath is older.
The collective operation at the heart of DDP is called AllReduce. Every process contributes a tensor, every process receives the element-wise reduction of all of them (sum or mean), and the two sides of the operation are atomic. AllReduce has lived inside MPI since the 1990s. What changed in 2017 was how to do it on a row of GPUs without choking on the slowest link in the cluster.
The breakthrough came out of Baidu's Silicon Valley AI Lab. Andrew Gibiansky, then on that team, wrote a blog post in early 2017 working through a ring-based AllReduce for deep learning. The algorithm itself was not new. Pitch Patarasuk and Xin Yuan had published a paper in the Journal of Parallel and Distributed Computing in February 2009 titled "Bandwidth optimal all-reduce algorithms for clusters of workstations" that laid out the same construction in the cluster-computing world. Baidu's contribution was to bring it across to deep learning training and to publish a working integration with TensorFlow. NVIDIA picked it up shortly after inside its NCCL library, and PyTorch built DDP on top of NCCL. The chain of citations is short. The idea moved fast because it was the right shape for the hardware.
What every GPU is holding
To understand the algorithm you first have to be precise about the state. In DDP, every worker process owns one GPU. Every GPU holds:
- A full copy of the model. Same parameters, byte for byte, on every card.
- A different shard of the batch. If the global batch is 256 and you have eight GPUs, each card sees 32 examples.
- A full optimizer state for the full model.
The forward and backward pass on each GPU is local. No communication. Each card computes its own loss on its own 32 examples and runs its own backward, producing a gradient tensor for every parameter. At the end of the backward pass, every GPU is holding a complete gradient for the full model that disagrees, in detail, with every other GPU's gradient because each was computed on different data.
The optimizer cannot step yet. If GPU 0 stepped on its own gradient and GPU 1 stepped on its own, their weights would diverge after one iteration and the whole construction would fall apart. So before the optimizer touches anything, the gradients have to be averaged across all GPUs. That averaging is the AllReduce.
The naive way, and why it fails
The obvious algorithm is to pick one GPU as a parameter server. Every other GPU sends its gradient to that one. The server sums them, divides by the worker count, and broadcasts the result back.
The bandwidth math is brutal. With N workers, the server receives N-1 tensors over its inbound link and sends out N-1 tensors over its outbound link, so its traffic scales linearly with the worker count. Add a ninth GPU and the central node has to do nine times the work it did at one. Past a handful of cards the parameter server becomes the bottleneck and adding GPUs stops helping.
You can do better with a tree topology, which is what NCCL falls back to at very large scale, but at the modest scales most teams operate at, the ring beats both.
The ring, in plain language
Arrange your N GPUs in a logical ring. Each one has exactly one neighbour to its left and one to its right. Now chop the gradient tensor on each GPU into N equal-sized chunks, labelled 0 through N-1.
The algorithm runs in two phases. Both phases take exactly N-1 steps.
Phase one is called reduce-scatter. On step one, GPU k sends chunk k to its right neighbour and receives the incoming chunk from its left neighbour, which it adds to its own copy of that index. On step two, each GPU sends the chunk it just updated to its right neighbour and again adds the incoming chunk to its own copy. After N-1 steps, each GPU is holding one chunk that contains the full sum across all GPUs. The other chunks on that card are partial. The crucial property: every GPU is responsible for finishing exactly one chunk of the sum, and each finishes a different chunk.
Phase two is called all-gather. The shape is the same as phase one, but with no addition. Each GPU now passes its one fully-summed chunk around the ring. After another N-1 steps, every GPU has every fully-summed chunk. Divide by N and you have the mean. That is the AllReduce.
Sketch it on paper with four GPUs and you can watch it work. Each GPU sends and receives roughly the same amount of data on every step. No node is hotter than any other.
Why this scales
The number that matters is total data moved per GPU. In a ring AllReduce on a tensor of K values, each GPU sends 2(N-1)/N times K bytes across the whole operation and receives the same amount, split evenly between the two phases. As N grows, 2(N-1)/N approaches 2. The per-GPU traffic does not grow with worker count.
That is the surprising bit, and it is the entire point. Add a fifth GPU to a ring of four and each GPU's outgoing data goes from 1.5K to 1.6K. Add a hundredth GPU and it sits just under 2K. Patarasuk and Yuan proved this construction is bandwidth-optimal: no AllReduce can do less per-GPU traffic. NVIDIA's NCCL implements rings inside a single node where peer-to-peer links exist, and switches to a double binary tree at very large scale where latency dominates over raw bandwidth. NCCL 2.4 added the tree variant and reported up to a 180-fold improvement over the ring at 24,576 GPUs on the Summit supercomputer. For most teams running tens of GPUs, the ring is what runs.
How PyTorch hides the algorithm from you
Reading the source you would never see a ring being assembled. You import DistributedDataParallel, wrap your model, call backward, and the gradients come back synchronised. The hiding is the engineering.
Under the wrapper, three mechanisms work together.
First, gradient bucketing. PyTorch could launch one AllReduce per parameter, but a small AllReduce has launch overhead that swamps the actual transfer. So PyTorch packs parameters into buckets and launches one AllReduce per bucket. The default bucket cap is 25 MiB, set by the constant _DEFAULT_BUCKET_CAP_MB at the top of torch/nn/parallel/distributed.py. You can override it with the bucket_cap_mb argument when you wrap the model.
Second, autograd hooks. When you construct the DDP wrapper, it walks the model's parameters and registers one backward hook on each. As the backward pass fires gradients into parameters in reverse layer order, the hook on each parameter checks whether its bucket is now full. When a bucket fills, the wrapper launches an AllReduce on that bucket without waiting for any other bucket. Crucially, the buckets are arranged in roughly reverse parameter order on purpose, because the gradients of the last layer finish first during backprop.
Third, overlap. AllReduce on NCCL runs on its own CUDA stream. While bucket k is being averaged across the ring, the GPU is still computing gradients for earlier layers and filling bucket k+1. By the time the backward pass finishes the first layer, most of the AllReduces for the deeper layers are already done. The communication is hidden behind the computation that was going to happen anyway.
Put together, the three pieces give you what the DDP paper calls near-linear scaling. Li and colleagues reported near-linear throughput up to 256 GPUs in their VLDB 2020 measurements. In practice, a two-GPU DDP run on consumer cards over PCIe gets you something around 1.9 times the single-GPU throughput, not the perfect 2. On NVLink the gap nearly closes.
The wall DDP runs into
DDP has one cost that is non-negotiable: every GPU holds the full model, the full gradients, and the full optimizer state. If any one of those does not fit on a single card, DDP cannot help. For a 7B-parameter model in mixed-precision training, that is roughly 14 GB for weights, 14 GB for gradients, and another 56 GB for an Adam optimizer's two state tensors, comfortably over 80 GB before activations. A single H100 fits it. A single 24 GB consumer card does not.
When the model itself outgrows the card, the right answer is not DDP. It is something that shards the model across GPUs instead of replicating it: Fully Sharded Data Parallel, ZeRO-3, tensor parallelism, or pipeline parallelism. Those are the next blogs in this series. DDP remains the right tool whenever the model fits and you want more throughput.
A common landmine: when you load a model with HuggingFace's device_map="auto" and then wrap it in DDP, the auto-mapper can split the model across GPUs as pipeline stages, which contradicts DDP's assumption that every card holds the full model. The training either hangs silently or produces garbage gradients. The fix is to pass a device_map that pins the whole model onto your local rank's GPU before wrapping.
The other common pitfall is gradient accumulation. By default, every backward call inside an accumulation loop fires AllReduce. If you accumulate for four steps, you have done three AllReduces you did not need. DDP exposes a model.no_sync() context manager that suppresses the AllReduce on intermediate backward calls; you only let it fire on the final accumulation step. Skip this and you can lose a meaningful fraction of your throughput to redundant communication.
What this looks like in 2026
Multi-GPU training has crossed the line where most teams who need it have access to it. AWS, GCP, Azure, Lambda Labs, Modal, and CoreWeave all rent A100, H100, and H200 nodes by the hour. PyTorch 2 ships DDP as the default for any model that fits, and the surrounding ecosystem, HuggingFace Accelerate, Lightning, and torchtune, wraps it with a few lines.
The interesting frontier is no longer the AllReduce itself. The ring is bandwidth-optimal, the tree handles latency at scale, and NCCL chooses between them automatically. The frontier is everything around it: communication compression for slow interconnects, async variants that relax the AllReduce barrier, and AllReduce specialised for the cases where the model and the gradients fit different layouts. These belong to later posts in the series.
The takeaway is concrete. The picture in your head should be a ring of GPUs passing chunks of a gradient around in two phases, with PyTorch's reducer quietly stitching the algorithm into the backward pass through bucketed hooks. Once that picture is steady, every other distributed training topic in this series sits on top of it.
Council summary
DDP exists because eight local gradients have to become one averaged gradient before any optimizer can move, and the AllReduce is what does that work. The ring construction by Patarasuk and Yuan, brought to deep learning by Baidu and shipped inside NCCL, is bandwidth-optimal: per-GPU traffic stays close to 2K regardless of cluster size. PyTorch then hides the algorithm behind 25 MiB gradient buckets, autograd hooks fired in reverse layer order, and a separate CUDA stream that overlaps communication with the backward pass. Hold that picture and the rest of the series, where the model itself stops fitting on one card, follows naturally.
Comments