distributed training

From One GPU to Many: A Distributed Training Playbook

A single GPU is enough until it isn't. The real question is which constraint broke first, and the cheapest fix before tensor parallelism.

Most teams reach for distributed training too early. Someone benchmarks a 7B model, sees an out-of-memory error, and the next message in Slack is a request for an 8 by H100 node. The run that follows costs ten times what it should, and the loss curve does not look any better. The team has solved a problem they did not have, using a tool they did not need.

The reverse mistake is just as common. A team forces a model onto one card with aggressive offloading, the step time creeps past two seconds, and a week of training becomes a month. They are paying with wall-clock time for hardware they could have rented for the afternoon.

Both failures come from skipping a question. When a single GPU is no longer enough, which constraint actually broke, and what is the cheapest fix that respects that constraint? This post is the overview piece for a seven-part series. The other six go deep on DDP, FSDP, tensor parallelism, pipeline parallelism, 3D parallelism, and expert parallelism. This one keeps you out of the wrong rabbit hole in the first place.

Origin: the memory bill nobody warned you about

Modern training has four memory line items, and only one is the part everyone talks about.

Parameters are the obvious one. A 7B model in BF16 is 14 GB of weights, because every parameter takes 2 bytes. A 13B model is 26 GB. A 70B model is 140 GB and already off any single consumer card.

Gradients are the same shape as the parameters, so the same arithmetic applies. They live alongside the weights in the same precision during training, so a 13B BF16 model gives you a second 26 GB block.

Optimizer states are the line item that surprises people, and the original ZeRO paper from Microsoft (Rajbhandari et al., 2019) is the clearest account of why. Adam keeps a running mean and a running squared mean per parameter, and mixed-precision training also keeps an FP32 master copy of the weights. All three live at 32-bit, which works out to 12 bytes per parameter on top of the 4 bytes of weights and gradients. The DeepSpeed team summarises the rule plainly in their ZeRO post: with Adam in mixed precision, a model uses about 16 bytes per parameter all-in, and 12 of those 16 sit in the optimizer.

Activations are the fourth line item and the most volatile. They are the intermediate tensors produced during the forward pass that the backward pass needs to compute gradients. The shape that matters is well-documented in the Megatron team's 2022 activation recomputation paper (Korthikanti et al.): activation memory grows linearly with hidden size, linearly with batch size, and quadratically with sequence length, because the attention matrix is sequence by sequence per head. Activations often outweigh the weights at long context lengths.

Mixed precision shifts these numbers, and not symmetrically. Forward and backward run in BF16 or FP16, which halves the parameter and gradient bill. The optimizer states stay at FP32, which is why ZeRO-1 attacks them first. NVIDIA's documentation on the Hopper architecture puts a single H100 SXM at 80 GB of HBM3 with 3.35 TB/s of memory bandwidth and 900 GB/s of NVLink. That is what one card gives you. That is what the rest of this framework has to fit inside.

Present: the three failure modes that push you off one GPU

Once you can do the memory math, the decision becomes mechanical. There are exactly three ways a training run breaks a single GPU.

The model is too big. Parameters plus gradients plus optimizer states blow past 80 GB. A 13B Adam training run in mixed precision is the canonical example: 16 bytes per parameter, so roughly 208 GB before activations. A single H100 is gone, and even a 192 GB B200 only just covers it before activations get in the door.

The batch is too small. The model fits, but the only batch that fits is so small that gradient updates are noisy and convergence drags. You want an effective batch of 1 million tokens and the card holds 32k. Throughput, not capacity, is the constraint.

The wall-clock is too long. The model fits, the batch fits, but one epoch over your training set takes a week. You will not iterate at that pace, and the team will stop running experiments.

These three failure modes do not share a solution. That is the single most useful idea in distributed training, and the Conscious Engines overview on the topic is good at hammering it home. Fixing memory does not fix throughput, and fixing throughput does not fix memory. Choose the technique that matches the constraint that actually broke.

The cheap fixes first

Before any multi-GPU setup, run through three single-GPU techniques. Each one buys you time.

Mixed precision. BF16 forward and backward halves the per-parameter cost of weights and activations relative to FP32. BF16 has the same exponent range as FP32, so it does not need the loss scaling that FP16 requires. PyTorch ships native AMP, and it is essentially free to turn on. If you are not running BF16 on an Ampere-or-newer card, you are paying for nothing.

Gradient checkpointing. Also called activation recomputation. The trick was first described in Chen et al. (2016), which showed an O(sqrt(n)) memory algorithm for an n-layer network at the cost of one extra forward pass per minibatch. The 2022 Megatron paper sharpens this with selective recomputation: applying it only to the attention layers (where activation memory is worst) cuts activation memory by around 70 percent at roughly 2.7 percent FLOP overhead, against the 30 to 40 percent overhead of full recomputation.

Gradient accumulation. Run forward and backward several times before stepping the optimizer, summing gradients in place. Effective batch size goes up, peak memory does not. It is throughput-neutral for the optimizer but reduces the noise of small batches.

Many runs that look like they need eight GPUs actually need BF16 plus checkpointing plus a slightly bigger effective batch.

When you do need more than one GPU: the sequence

Once the single-GPU tricks are exhausted, the order in which you reach for distributed techniques matters, and it is not the order most teams guess.

1. Distributed Data Parallel (DDP)

If the model still fits on one card and the only problem is wall-clock, this is the right answer. DDP replicates the model on every GPU, splits the batch across them, and synchronises gradients with a ring all-reduce as the backward pass completes. Communication overlaps with backward compute, and the gradient sync amortises across the step, which is why DDP scales well over modest interconnects. PyTorch's introduction to DistributedDataParallel is the canonical reference. The thing to keep in mind: DDP solves throughput, never memory. Every rank still holds a full copy of the weights, gradients and optimizer states.

2. ZeRO-1 and ZeRO-2

If the model fits but optimizer state does not, ZeRO Stage 1 shards only the optimizer states across ranks and delivers a 4 times memory reduction at communication volume comparable to plain DDP, in the Microsoft team's measurements. ZeRO-2 also shards the gradients for an 8 times reduction. You keep most of the simplicity of data parallelism and remove the most expensive line item from each card.

3. Fully Sharded Data Parallel (FSDP)

This is the right answer when the model itself does not fit. FSDP, the PyTorch native implementation of ZeRO-3, shards parameters, gradients and optimizer states across ranks. Each layer is gathered just in time for its forward pass, used, and resharded. Memory scales close to linearly with the number of GPUs. The cost is communication: where DDP does a single gradient all-reduce per step, FSDP communicates at every layer boundary, in both directions. On NVLink that is fine. On PCIe it can dominate. The PyTorch FSDP tutorial and the Hugging Face FSDP integration guide are the practical references.

4. Tensor parallelism

For models large enough that an individual layer is too big for one GPU, or for setups where the interconnect is too slow for FSDP, tensor parallelism from the Megatron-LM paper (Shoeybi et al., 2019) is the next step. The weight matrices inside each layer are split across GPUs, and the GPUs synchronise within each layer through all-reduce. Tensor parallelism moves activations rather than parameters, which is often less data per step, but it does it many times per layer. It is fast inside a node where NVLink is available, painful across nodes where it is not. The Megatron team reported training an 8.3B model on 512 GPUs at 76 percent scaling efficiency relative to a strong single-GPU baseline.

5. Pipeline parallelism

For models that span multiple nodes, pipeline parallelism from the GPipe paper (Huang et al., 2018) groups consecutive layers into stages and assigns each stage to a different GPU. Activations flow forward and gradients flow back, in micro-batches, like an assembly line. The trick is that communication only happens at stage boundaries, so a slow inter-node link is tolerable. The cost is the pipeline bubble: the warm-up and cool-down phases when not every stage is busy. Smaller micro-batches reduce the bubble but increase memory pressure.

6. Combinations: 3D parallelism

The frontier scale runs do not pick one. Microsoft and NVIDIA's Megatron-Turing NLG 530B work used 8-way tensor parallelism inside a node, 35-way pipeline parallelism across nodes, and data parallelism across pipeline replicas. The Megatron-LM scaling paper reported a 1T-parameter run at 502 petaFLOP/s on 3072 GPUs, 52 percent of theoretical peak per GPU. The implementation is harder, the topology assumptions are stricter, and you are firmly in the world of multi-week engineering, not a checkbox in a config.

A decision sequence you can actually use

Start at the top. Stop at the first technique that fits.

  1. Can you finetune instead of training from scratch? If so, look at LoRA and QLoRA before any of this.
  2. Is the model fitting on one card? Turn on BF16 and gradient checkpointing. Try a bigger effective batch with gradient accumulation.
  3. Does the model fit but wall-clock is the problem? Use DDP.
  4. Does the model fit but optimizer state is the problem? Use ZeRO-1 or ZeRO-2.
  5. Does the model not fit? Use FSDP if your interconnect is fast (NVLink, NVSwitch). Use tensor parallelism inside a node if it is not.
  6. Is a single layer too big? Use tensor parallelism.
  7. Does the model span nodes? Add pipeline parallelism across the slow link.
  8. Are you in trillion-parameter territory? You are doing 3D parallelism, and you have a dedicated systems team.

The order is not arbitrary. Each step costs more in engineering and in interconnect requirements than the one before it, and skipping ahead almost always produces a slower, more expensive training run than the one you should have built.

Future and impact

The shape of distributed training is changing in two directions at once.

At the high end, memory keeps moving onto fewer, fatter cards. The B200 brings 192 GB per GPU on HBM3e at 8 TB/s, enough to put a 70B model in FP16 inference on one card and bring a 13B Adam finetune within reach of a single GPU again. The new generation will not eliminate distributed training, but it collapses a whole tier of decisions. Runs that needed 8 GPUs in 2023 will need 1 in 2026. Whether that shrinks the market for distributed-training engineers or just pushes the frontier up by an order of magnitude is the honest open question; in practice it does both, unevenly.

At the low end, the interconnect story matters more than the parameter count. FSDP wins on H100 SXM and loses on consumer PCIe boxes because of the 900 GB/s NVLink versus PCIe Gen5's roughly 64 GB/s per direction. Tensor parallelism gains ground on slow links because it moves activations rather than parameters. Expert parallelism, covered in the seventh post of this series, changes the shape again by activating only a fraction of parameters per token.

Two risks are worth naming. Distributed training is a place where small misconfigurations cost a lot: a rank desync, a wrong process group, a stale NCCL version, and a 32-GPU run produces a worse loss than a 1-GPU run did. And the cost of a failed multi-day training job is paid in real money, whether the model converged or not. Make the topology and the launcher script boring before you scale them.

If your team is shipping AI features rather than training base models, the practical answer is usually further down the list than you expected. Finetune. Cache. Reach for distributed training only when the math says you must.

Council summary

This post argues one point: when one GPU runs out, identify which of memory, batch size, or wall-clock broke, then pick the cheapest fix for that constraint. The review verified every primary-source claim against the original papers (the ZeRO 12 of 16 bytes per parameter for mixed-precision Adam, Megatron 8.3B at 76 percent scaling on 512 GPUs, the 502 petaFLOP/s 1T run on 3072 GPUs, H100 SXM at 80 GB with 900 GB/s NVLink, B200 at 192 GB) and corrected an internal arithmetic error: a 13B model in mixed precision needs roughly 208 GB end-to-end, not 156 GB. Selective recomputation numbers were re-sourced from Korthikanti et al. (2022) at around 70 percent activation savings for 2.7 percent FLOP overhead. The reader walks away with a numbered sequence (LoRA, BF16 plus checkpointing, DDP, ZeRO, FSDP, tensor, pipeline, 3D) and a rule for when to stop climbing.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next