3D parallelism

3D Parallelism: Trillion-Parameter Models on GPU Meshes

No single parallelism trick trains a frontier model. The fix is a 3D GPU mesh where each axis gets the network it deserves.

A trillion-parameter language model needs roughly 16 terabytes of memory just to hold weights, gradients, and optimizer state in mixed precision. An H100 has 80 gigabytes. The math is simple and humbling. No single GPU on Earth can train one of these models, and no single parallelism trick scales to the cluster sizes you need.

The trick that actually ships frontier models is to use three at once. Tensor parallelism splits each layer's matrices across a handful of GPUs that share a fast bus. Pipeline parallelism cuts the model into stages, one per node, and streams microbatches through them. Data parallelism replicates the whole pipeline across the cluster and averages gradients between replicas. Stacked together they form a 3D mesh, with each axis aimed at the network tier whose cost it can survive. This post is about that mesh, why it has exactly that shape, and the configurations the labs actually use.

This is the fourth post in our distributed training series. Earlier posts cover tensor parallelism inside a node, the pipeline bubble problem, and ZeRO and gradient sharding. This one is where the three ideas combine.

Origin: three answers to three different problems

3D parallelism was not designed. It accreted. Each axis grew out of a different bottleneck a team hit while trying to train something larger than the previous record.

Data parallelism is the old one. Run the same model on every worker, give each one a different slice of the batch, average the gradients, repeat. PyTorch's DistributedDataParallel made it routine years ago. It scales the throughput linearly until you hit two walls: the per-GPU memory has to hold the entire model, and the global batch size you need to keep every replica busy starts to hurt convergence. For GPT-3 sized models, both walls hit before you get past a few thousand GPUs.

Tensor parallelism showed up in 2019 with the first Megatron-LM paper from NVIDIA. Mohammad Shoeybi and colleagues observed that a transformer's two heaviest operations, the attention projection and the MLP, are both giant matrix multiplications, and a matrix multiplication is trivially splittable along its rows or columns. Cut the weights, scatter them across GPUs, do the partial work, sum the results with an AllReduce. Each layer's forward and backward pass now needs two AllReduces. That sounds cheap, but it happens for every layer of every microbatch, so the AllReduces have to be fast. On a DGX A100, where eight GPUs sit on a single NVSwitch fabric at 600 GB/s, it is fast enough. Across an InfiniBand link at 200 Gb/s, it collapses. Tensor parallelism only works inside a node.

Pipeline parallelism arrived from two directions at once. Google's GPipe paper in 2018 cut a model into sequential stages and reformulated training as a software pipeline: stage 1 feeds stage 2 feeds stage 3, with microbatches flowing through. The catch is the bubble. Stage 2 sits idle while stage 1 warms up, every stage sits idle while the last one waits for the first's backward pass. PipeDream and the second Megatron paper sharpened the schedule, first with 1F1B (one forward, one backward, alternating) and then with an interleaved variant that assigns each device several non-adjacent virtual stages. Narayanan and colleagues showed the interleaved schedule shrinks the bubble fraction by a factor equal to the number of virtual chunks per device, typically a single-digit percentage in the configurations that ship, at the cost of more cross-stage communication.

By 2021 all three existed as separate tools. The Megatron-LM paper from Narayanan and colleagues at NVIDIA, Stanford, and Microsoft was the one that put them in a mesh and showed the numbers. They trained a 1 trillion parameter dense model on 3,072 A100 GPUs at 502 petaFLOP per second, hitting 52 percent of peak. That was the moment 3D parallelism became the default.

Present: the mesh, drawn carefully

The mesh is the part that takes a minute to internalise, so it is worth drawing it on paper at least once.

Suppose you have 64 GPUs, in 8 nodes of 8. You pick three numbers: a tensor-parallel size (TP), a pipeline-parallel size (PP), and a data-parallel size (DP), with TP times PP times DP equal to the total. A common choice for this cluster is TP=8, PP=4, DP=2.

  • TP=8 says every layer is split across all 8 GPUs in a single node. Each node holds a sliver of every weight matrix for the layers assigned to it.
  • PP=4 says the model is cut into 4 sequential stages, each stage lives on a different node, and stages send activations forward and gradients backward.
  • DP=2 says the entire 4-node pipeline is replicated, and the two replicas average gradients at the end of each step.

Now think about the communication. Tensor parallelism issues AllReduces at every layer of every microbatch, hundreds of times per step, and they need to land in microseconds. Those AllReduces ride NVLink and NVSwitch inside the node. Pipeline parallelism passes a single activation tensor between stages once per microbatch, then a gradient tensor back. Those messages are larger but rarer, and they ride InfiniBand or RoCE between nodes. Data parallelism only reduces gradients once per step, after every microbatch is done; the volume is the size of the model but the frequency is low, so it can ride the slowest tier of the network that crosses replicas. The cost of each form of communication maps onto a network tier that can pay it.

This is why TP=8 is so common. NVIDIA's DGX nodes have 8 GPUs and a single NVSwitch domain. Push tensor parallelism to TP=16 and you cross a node boundary, the AllReduces fall off NVLink, and per-GPU throughput collapses. The rule of thumb every production stack now follows is: keep TP at or below the number of GPUs sharing fast intra-node interconnect, and let PP and DP carry the cross-node work.

Each axis defines its own process group. PyTorch's DeviceMesh and the older torch.distributed.new_group build these for you. With TP=2, PP=2, DP=2 across 8 ranks numbered 0 to 7, the groups Megatron sets up look like this: tensor-parallel groups {0,1}, {2,3}, {4,5}, {6,7}; pipeline-parallel groups {0,2}, {1,3}, {4,6}, {5,7}; data-parallel groups {0,4}, {1,5}, {2,6}, {3,7}. Each rank belongs to exactly one group on each axis, and the three axes are disjoint. The AllReduce for tensor parallelism happens inside its TP group only, never touching the others.

The other detail worth getting right is which dimension is the fastest. In Megatron's convention the rank layout is (pipeline, data, tensor), so consecutive ranks share a TP group, then a DP group, then a PP stage. That layout exists for a reason. Consecutive ranks land on the same node, which is exactly where you want your tensor-parallel partners. Get the ordering wrong and you scatter TP groups across InfiniBand, and the run dies.

Configurations the labs actually used

The mesh becomes concrete the moment you look at what was shipped.

Megatron-LM 1 trillion parameter run. From the NVIDIA Megatron-LM paper, Narayanan et al. 2021. TP=8 inside each DGX A100, PP=64 across 64 nodes, DP=6 across replicas of the full pipeline. 3,072 A100 80GB GPUs in total. 502 petaFLOP per second sustained, 163 teraFLOP per GPU, 52 percent of peak. The model had 128 layers and a hidden size of 25,600. The interleaved schedule kept the pipeline bubble small enough that the run held its efficiency at scale. None of those numbers would land without all three axes doing their job.

Megatron-Turing NLG 530B. A joint Microsoft and NVIDIA training run reported in Smith et al. 2022. TP=8 (one DGX A100 node), PP=35 (across 35 nodes), with data parallelism on top. Each model replica spanned 280 GPUs on NVIDIA's Selene supercomputer, which has 560 DGX A100 nodes (4,480 A100 80GB GPUs) on HDR InfiniBand in a full fat-tree. The fat-tree is what lets PP=35 stages sit on different nodes without the bubble swallowing throughput. The paper reports scaling experiments at 280, 350, and 420 DGX A100 servers (DP=8, 10, 12) of 126, 121, and 113 teraFLOP per GPU respectively, a clean demonstration of how data-parallel efficiency degrades as the DP axis widens.

PaLM 540B. Google's Chowdhery et al. 2022 ran on TPUs, not GPUs, and used the Pathways system to coordinate two full TPU v4 Pods, 6,144 chips in total (3,072 per pod). The shape is different because TPU pods give you a 3D torus interconnect rather than a node-and-fabric hierarchy, but the principle is identical. Inside each pod the model is sharded with 12-way model parallelism and 256-way fully sharded data parallelism over the torus. Between pods, two-way data parallelism rides the slow data-centre network. Each pod holds a full copy of the parameters and the client splits the batch in half. Google reports 46.2 percent model FLOPS utilisation and 57.8 percent hardware FLOPS utilisation. Same hierarchy, different hardware vocabulary.

Llama 3.1 405B. Meta's training stack uses TP=8, PP=16, and CP (context parallelism) on top, with FSDP for the data-parallel dimension across roughly 16,000 H100s. The configuration is documented in Meta's Llama 3 release notes and the accompanying paper. TP=8 still lives inside a node. PP=16 still crosses InfiniBand. Context parallelism is the new axis, added when sequence lengths push past 100,000 tokens. The underlying logic of "match each axis to a network tier" did not move. Meta reports about 400 teraFLOP per GPU at 8K sequence length, dropping to 380 at 128K.

The pattern repeats because the hardware does. NVLink is fast and short. InfiniBand is slower and longer. The hierarchy of interconnects forces a hierarchy of parallelism, and the only thing that varies between labs is how aggressive they get on each axis.

How DeepSpeed and Megatron divide the labour

Two stacks dominate production. Both end up doing 3D parallelism, but they come at it from different sides.

Megatron-LM, maintained by NVIDIA, implements TP and PP natively at the model code level. The transformer blocks are explicitly split, communication groups are wired in, and the trainer expects you to pass --tensor-model-parallel-size, --pipeline-model-parallel-size, and a data-parallel size derived from the GPU count. It is the canonical reference implementation. Most other libraries are compared against it.

DeepSpeed, maintained by Microsoft, was built around ZeRO, a different way of attacking the data-parallel axis. ZeRO shards optimizer states, gradients, and optionally parameters across the DP group, so each replica no longer has to hold the full model. ZeRO stages 1, 2, and 3 trade off memory savings against communication. The Megatron-DeepSpeed integration plugs Megatron's TP and PP into the model and lets DeepSpeed's ZeRO handle the DP slice. The result is the stack that trained MT-NLG 530B, and it remains the standard recipe when you want all three axes plus optimizer sharding.

PyTorch's native path is newer and cleaner. DeviceMesh, parallelize_module, and PipelineStage let you compose the three axes from primitives. TorchTitan, Meta's reference trainer, is built this way. Colossal-AI sits on top with a HybridParallelPlugin that wraps the whole thing for arbitrary Hugging Face models.

The choice between them is mostly about how much control you want and how exotic your model is. Pick Megatron-DeepSpeed if you are training a standard decoder-only transformer at scale and want known-good throughput. Pick the PyTorch native stack if you want to compose your own thing.

Future and impact: the mesh is getting another axis

Three dimensions stopped being enough about a year ago. Context parallelism, which splits the sequence dimension across GPUs, is now standard for long-context training. Expert parallelism, which scatters the experts of a mixture-of-experts model across GPUs, is required to train anything DeepSeek-V3 sized. Some recent papers refer to "5D parallelism" or simply hybrid parallelism, and the diagram on paper genuinely has five orthogonal axes.

The underlying logic has not changed. Each axis has a characteristic communication pattern. Each pattern has a frequency and a volume. Each frequency-and-volume pair maps onto a level of the network hierarchy where it is cheapest. The hierarchy now includes NVLink inside a node, NVLink Switch across an NVL72 rack on Blackwell, InfiniBand across the building, and ethernet between sites. Adding axes is mostly adding rungs to the ladder.

Two trends are worth watching. The first is that the rack is becoming the new node. NVIDIA's GB200 NVL72 puts 72 GPUs on a shared high-bandwidth fabric, which is exactly the regime where TP can usefully grow past 8. Expect TP=16 or TP=32 to become normal in 2026, with the corresponding shrink in PP. The second is that the compiler is taking over. JAX's GSPMD, PyTorch's automatic parallel and the DTensor work, and Google's Pathways are all converging on the idea that you should describe what you want sharded and the compiler should pick the mesh. The Megatron-style explicit configuration is not gone, but it is getting an autopilot.

The reason any of this matters for a smart practitioner is unromantic. Frontier models cost on the order of a hundred million dollars to train. The difference between 30 percent and 55 percent of peak throughput is the difference between $100M and $55M for the same run. 3D parallelism is the largest single lever for that efficiency. It is the reason a trillion-parameter model is a budget item rather than a fantasy.

Council summary

3D parallelism is not a clever optimisation, it is what the hardware forces. Tensor parallel goes on NVLink, pipeline parallel goes on InfiniBand, data parallel goes on whatever crosses the cluster, and the mesh exists because each axis has a communication frequency and volume that only one tier of the network can pay. The shipped configurations make the rule concrete: TP=8, PP=64, DP=6 on 3,072 A100s for the Megatron 1T paper; TP=8, PP=35, DP up to 16 for MT-NLG 530B on Selene; two TPU v4 pods with 12-way model parallelism and 256-way FSDP inside each for PaLM 540B; TP=8, PP=16, with FSDP and context parallelism for Llama 3.1 405B. The pattern is fixed even as the rack replaces the node and the compiler starts picking meshes for you. Understand the mesh and you understand why the trillion-parameter cost line is a number a CFO can sign rather than a fantasy.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next