A 70B parameter language model has a feed-forward layer that, on its own, will not fit on a single GPU. Not the activations. Not the gradients. The weight matrix itself, just the parameters, will not fit. At that point, sharding the batch across GPUs no longer helps you. Sharding the model state across GPUs (ZeRO, FSDP) no longer helps you either, because the layer needs to be reconstructed before you can multiply by it, and the reconstruction does not fit.
You have to split the layer.
Tensor parallelism is the technique for doing exactly that. It cuts a single matrix multiplication across several GPUs, runs the halves in parallel, and stitches the result back with a collective. It is the only one of the three classical parallelism dimensions that lives strictly inside a single linear algebra operation, and it is the most expensive of the three to communicate. This post is about how it works, where the all-reduces sit, why you almost never run it across a network cable, and what sequence parallelism adds on top.
This is the third post in our distributed training series. Earlier pieces cover moving from one GPU to many, data parallelism and ring all-reduce, and ZeRO and FSDP sharding. The next post stitches all three dimensions together into a 3D parallel mesh.
Origin: a matrix multiply you can cut two ways
The mathematics behind tensor parallelism is the part that takes about five minutes to see and the rest of your career to use well. A matrix multiply Y equals XA, with X of shape (batch by sequence, hidden_in) and A of shape (hidden_in, hidden_out), can be partitioned in either of two orientations.
Cut A along its output columns. The first GPU holds the left half of A, the second holds the right half. Each GPU multiplies the full input X by its own column slab and produces half of Y, side by side. No communication is needed before the multiply, because both GPUs already have X. The outputs concatenate to form Y. This is column-parallel partitioning.
Cut A along its input rows instead. The first GPU holds the top rows of A, the second holds the bottom rows. The input X must be cut to match, with the first GPU getting the left columns of X and the second getting the right columns. Each GPU multiplies its slice of X by its slice of A and produces a partial sum. To recover Y you add the partial sums together with an all-reduce. This is row-parallel partitioning.
That is the whole game. Column-parallel needs no communication before the multiply but produces output that is sharded along the feature dimension. Row-parallel needs input that is already sharded along the feature dimension and produces a result that needs to be summed across GPUs. The trick that makes the technique cheap, when it is cheap, is to chain a column-parallel layer directly into a row-parallel layer. The sharded output of the first becomes the sharded input of the second, and the two communication steps you would have needed (one to concatenate, one to gather inputs) collapse into a single all-reduce after the row-parallel multiply.
This is the insight Mohammad Shoeybi and colleagues at NVIDIA wrote up in 2019, in the paper that introduced Megatron-LM. They trained an 8.3 billion parameter GPT-2 variant across 512 GPUs and showed 76% scaling efficiency, which at the time was startling for a model of that size. The implementation, they noted, fit inside native PyTorch with the insertion of a handful of communication primitives.
How a transformer layer actually splits
A standard transformer block has two sublayers that swallow most of the parameters: the multi-head attention and the feed-forward MLP. Megatron-LM partitions both with the same trick.
Inside attention, the three projections that compute queries, keys, and values from the hidden state are column-parallel. With tensor parallel degree N, each GPU computes the Q, K, and V slices for a subset of the attention heads. A 32-head model on a degree-4 tensor parallel group puts 8 heads on each GPU. The dot-product attention itself (softmax over scores, weighting by V) runs entirely locally on those heads, because each head was always an independent computation. The only thing the heads share is the output projection that mixes them back together, and that projection is row-parallel. Its input is the head outputs (already sharded across GPUs by the column-parallel QKV), so each GPU multiplies its share of the heads by its row slab of the output weights, and a single all-reduce sums the partials. One all-reduce per attention block.
The MLP follows the same pattern. The first linear projects the hidden dimension up to (typically) four times its size and feeds it to a nonlinearity (GeLU in the original Megatron, SwiGLU in modern models). That first matrix is column-parallel, so each GPU computes a slab of the expanded representation. The nonlinearity is elementwise and runs locally. The second linear projects the expanded representation back to the hidden dimension, and it is row-parallel, taking the column-sharded activations as input and summing the partial outputs with an all-reduce. One all-reduce per MLP.
Two all-reduces per transformer layer in the forward pass. Two more in the backward pass, in mirror image, where the all-reduces sit on the column-parallel boundaries instead of the row-parallel ones. This is what people mean when they say the f and g operators in the Megatron diagrams: f is an identity in the forward and an all-reduce in the backward, g is the reverse. Each operator marks the place where the GPUs in the tensor parallel group have to talk to each other.
For a 36-layer model that is 72 all-reduces in forward and 72 more in backward per step. Every one of them blocks. The forward pass cannot continue until the partial sums have been combined. The backward pass cannot continue until the gradients have been combined. There is no overlap with compute the way there is in data parallelism, because the compute being overlapped does not exist yet. It is waiting on the collective.
Why tensor parallelism lives inside the node
That blocking is why tensor parallelism is the most bandwidth-hungry of the three classical parallelism dimensions, and why it almost never crosses a node boundary in practice. The all-reduce at the end of each row-parallel linear moves a tensor whose size is the activation, not the parameter, so the absolute bytes per collective are smaller than what FSDP sends. But the collective happens twice per layer rather than once per step, and it sits in the critical path of every forward and backward pass.
The throughput you get is therefore close to a direct function of how fast the GPUs in the tensor parallel group can talk to each other. On an NVIDIA NVLink interconnect, eight H100s in an SXM tray have roughly 900 GB/s of bidirectional bandwidth per GPU. That is more than ten times what PCIe Gen5 gives you on the same machine, and more than 30 times what 400 Gb/s InfiniBand gives you between nodes. The all-reduces that take a millisecond on NVLink take ten on PCIe and a hundred on the network. The model stalls.
The Conscious Engines tensor parallel writeup (tp-deepdive.md) measures this directly on a desktop-class box with PCIe-linked 4090s and finds that the all-reduce time dominates the step. Their FSDP baseline runs about 43 seconds per step on a 4B model. A tensor parallel run with sequence parallelism layered on top runs the same model in around 7 seconds. The reason is not that tensor parallelism is faster; it is that on PCIe the FSDP weight gathers are even slower than the tensor parallel activation reductions, and the smaller of two bad numbers wins.
On NVLink the comparison reverses for small clusters, because both techniques fit comfortably under the interconnect's budget and FSDP gets to overlap its communication with compute. The practical rule, baked into the way nearly every published frontier-model configuration is built, is that tensor parallelism stays inside one node. NVIDIA's own scaling analysis in the Megatron tensor and pipeline parallel paper is explicit about the limit: above eight-way TP, the cross-node all-reduces start to dominate even on InfiniBand.
Sequence parallelism: the other half of the activations
The Megatron design from 2019 leaves one piece of the layer unsharded. The hidden activations between the attention and the MLP, and the layer norms and dropout in between, are replicated on every GPU in the tensor parallel group. For modest context lengths this is fine. For long-context training the replicated activations get expensive fast. At 128k tokens with an 8192-dim hidden state in bfloat16, that buffer is about 2 GB, and with a degree-8 tensor parallel group it sits on all eight GPUs.
Vijay Korthikanti and colleagues fixed this in 2022 with sequence parallelism, which shards the activations along the sequence dimension during the parts of the layer (layer norms, dropout, residual adds) that are not themselves tensor parallel. The choreography is tight. Inside a tensor parallel region the activations live in their column-parallel form. Just before the region ends, instead of doing a single all-reduce to gather them, Megatron replaces that all-reduce with a reduce-scatter that hands each GPU one slice of the sequence. The layer norm runs on the slice. When the next tensor parallel region begins, an all-gather reconstructs the full sequence on each GPU.
The communication accounting is the part that surprised people. An all-reduce, decomposed, is exactly one reduce-scatter plus one all-gather. So replacing the all-reduce with a reduce-scatter and the next region's input replication with an all-gather costs zero extra bytes. You get sequence-sharded activations everywhere outside the tensor parallel attention and MLP for free, and the activation memory drops by the tensor parallel degree.
The paper reports a roughly 5x reduction in activation memory and a more than 90% reduction in recomputation overhead, with which they trained a 530B parameter GPT variant on 2240 A100s at 54.2% model FLOPs utilization. PyTorch ships the same idea under SequenceParallel, defined in the torch.distributed.tensor.parallel package, where it composes with ColwiseParallel and RowwiseParallel to mirror the Megatron-LM partitioning (PyTorch docs).
Present: where it ships
NVIDIA's Megatron-LM repository and its Megatron-Core library remain the reference implementation. Inside Megatron-Core the partitioning shows up as ColumnParallelLinear and RowParallelLinear, the f and g operators are explicit modules, and sequence parallelism is a flag you flip on the transformer config. The same pattern reappears in NeMo, DeepSpeed, and in HuggingFace's text generation inference stack, which uses tensor parallelism specifically to serve large models faster by partitioning the weights across the GPUs in one server.
PyTorch's native tensor parallel API exposes the building blocks at a finer grain. parallelize_module takes a model and a dictionary that maps submodule names to parallelization styles (ColwiseParallel, RowwiseParallel, SequenceParallel, and a small set of input-output preparation helpers). The DTensor abstraction underneath tracks which dimension of each tensor is sharded and inserts the right collective when the dimension has to change. This is what 3D parallelism configurations are built on at the layer level.
Most production frontier models run tensor parallel degrees of 2, 4, or 8. PaLM trained with TP 12 inside a 12-chip TPU node. Megatron-Turing NLG 530B ran TP 8 across the eight A100s in each DGX. Llama 3 405B's published configuration uses TP 8 inside the H100 server. The number is set by what fits on an NVLink island, not by what would be ideal for the model.
Future and impact: when TP earns its keep, and when it does not
Tensor parallelism is worth its cost in three situations. The first is when a single layer's weight matrix will not fit on one GPU and you cannot reduce the model dimension. Mixture-of-experts can avoid this by routing through smaller experts, but dense transformers cannot. The second is when the attention head count or head dimension is so high that even with FSDP the per-GPU shard of one layer plus its activations does not fit. The third is at inference time, where there is no batch to shard and tensor parallelism is essentially the only way to serve a model larger than one GPU with low latency.
Where tensor parallelism is losing ground, slowly, is the middle of the training cluster. Pipeline parallelism is getting better at hiding the bubble (the zero-bubble schedules in recent work close most of the gap), FSDP gets better year over year at overlapping weight gathers with compute, and context parallelism handles sequence sharding more elegantly than the bolt-on sequence parallel layer. For models in the 7B to 70B range, a clean FSDP plus pipeline parallel configuration without tensor parallelism is often competitive on H100 clusters where NVLink is plentiful.
For models above 100B parameters, or any inference deployment that needs sub-second latency on a model bigger than 80 GB, tensor parallelism remains the only technique that works. It is the most expensive collective per layer, and the one that forces you to buy a particular shape of hardware (a node with a fast intra-server interconnect), but inside that constraint it does something nothing else does. It lets one matrix multiply be eight GPUs wide.
Council summary
Tensor parallelism is the parallelism dimension that lives inside the matrix multiply. Megatron-LM's chain of a column-parallel projection into a row-parallel projection collapses the math to one all-reduce per attention block and one per MLP, four collectives per layer per step counting the backward pass, all of them blocking. That makes activation bandwidth, not parameter bandwidth, the binding constraint, which is why production setups keep the tensor parallel group inside a single NVLink island and almost never above degree eight. Sequence parallelism plugs the last leak, sharding the layer-norm and dropout activations along the sequence dimension at zero extra communication. For dense models above 100B parameters and for low-latency inference on any model larger than one GPU, tensor parallelism is still the only technique that works.
Comments