pipeline parallelism

Pipeline Parallelism and the Microbatch Bubble

When sharded weights still will not fit on one GPU, you slice the model into stages and pass activations down a line. The price is a triangle of idle time.

Imagine a single GPU step for one training batch on a model so deep that no parameter sharding trick fits it. Forward sweeps through every layer. Backward sweeps back through every layer. The arithmetic does not care which device the layers sit on, but the wallclock does, and so does the cost. Pipeline parallelism is the answer when the model is too tall to live inside one accelerator's memory, even after a careful FSDP shard. You cut the network into horizontal slabs, give each slab to a different GPU, and stream activations down the chain. The idea is older than transformers. The pain is newer, and it has a name: the bubble. This post explains the trade off, the math behind it, and why the schedule you choose still matters in 2026.

This is part of a seven post run on distributed training. Earlier entries covered data parallel and fully sharded approaches and how tensor parallelism cuts across a single layer. Pipeline parallelism is the third axis, the one you reach for when depth is itself the problem.

Origin: Google needed a way to train a 6 billion parameter translator

Pipeline parallelism the way most engineers now use it traces to a 2018 paper out of Google Brain called GPipe. The authors, Huang and colleagues, were pushing a multilingual translation model past anything that fit on the accelerators of the day. Their answer was a library that took any model expressible as a sequence of layers, partitioned it across a row of devices, and shuttled activations between them. With it they trained a 557 million parameter AmoebaNet on ImageNet and a 128 layer, 6 billion parameter Transformer that translated more than 100 languages at once. Both were oversized for a single accelerator at the time.

The mechanism is simple to picture if you stop thinking about training and just watch the forward pass. Layers 1 to 32 live on GPU 0. Layers 33 to 64 live on GPU 1. The mini batch starts on GPU 0, which produces activations and posts them to GPU 1, which produces its own activations and so on. On the way back, gradients flow in reverse. Between devices the only traffic is a point to point send and a matching receive, not the all reduce collective that data parallel training spends most of its life waiting on.

The catch is what every introduction to GPipe shows next: the pipeline runs almost empty for most of the step. Stage 0 finishes its slice and then sits idle while stages 1 through p propagate the activations forward. Stage p does its work and waits for the loss. The whole chain must drain, then refill in reverse for the backward pass. With two stages, half the wallclock goes to staring at the wall. With eight, the picture is worse. The authors called this idle space the bubble, and pipeline parallelism since has been a slow conversation about how to shrink the bubble without breaking the math.

Origin: microbatching turned the bubble from a tax into a knob

GPipe's central idea is that the input batch does not have to flow through the pipeline as one unit. Split it into m smaller microbatches. As soon as stage 0 finishes microbatch 1, it can start microbatch 2 while stage 1 starts on microbatch 1. The pipeline fills incrementally, then runs at full utilization through the middle of the step, then drains in reverse. The bubble is still there at the edges, but it shrinks as a fraction of the total time as m grows.

The numbers behind this are the formula every paper in the area shares. For p pipeline stages and m microbatches per optimizer step, the share of step time spent in the bubble is:

bubble fraction = (p - 1) / (m + p - 1)

This formula is stated explicitly in the Megatron-LM 1 trillion parameter paper and is the same in spirit as the construction GPipe used. Plug in some realistic settings to feel it.

  • 2 stages, 8 microbatches: 1 / 9, about 11 percent idle.
  • 4 stages, 16 microbatches: 3 / 19, about 16 percent.
  • 8 stages, 32 microbatches: 7 / 39, about 18 percent.
  • 16 stages, 64 microbatches: 15 / 79, about 19 percent.

Two things drop out of staring at this. The first is that increasing p hurts you, and the second is that increasing m helps you, but with diminishing returns. Once m is much larger than p, the bubble approaches zero. Once m is comparable to p, it is permanent. Real training runs always end up choosing m well above p for exactly this reason.

GPipe pays for its m with memory. To compute the backward of microbatch 1, the forward activations of microbatch 1 must still be in memory when its backward starts, and that does not happen until every other microbatch has marched through the rest of the pipeline. The peak activation footprint grows roughly linearly in m. In practice this caps how aggressively you can shrink the bubble before you run out of GPU memory, especially on long contexts where activations are huge.

Origin: PipeDream and the asynchronous heresy

In June 2018, a Microsoft Research and Carnegie Mellon team led by Harlap and Narayanan published PipeDream. They were not the only group thinking about pipelining a deep model, but their paper introduced two ideas that the field never let go of. The first was the schedule now known as 1F1B, for one forward, one backward. Instead of letting all forwards run before any backward starts, PipeDream interleaved them: once a stage finished a forward pass for some microbatch, it switched to the backward pass for an older microbatch whose activations had returned from the deeper stages. The pipeline still had a fill and drain phase, but in steady state every device alternated forward and backward microbatches.

The second idea was harder to sell. To keep 1F1B running without idle time, PipeDream allowed different in-flight microbatches to see different weight versions inside the same optimizer step. Weight stashing kept each microbatch self-consistent (the backward used the same weights as the forward), but the global step no longer matched a single synchronous update. PipeDream's authors argued the staleness rarely hurt accuracy, and reported up to five times faster time to target accuracy than data parallel training on the same hardware, plus a 95 percent cut in communication for large models.

The trade off was contentious. Mainstream training infrastructure prefers exact data parallel semantics, where the gradient is the gradient you would have gotten with one giant batch on one device. The same group returned at ICML 2021 with PipeDream-2BW, where 2BW stood for two buffered weights. By double buffering, they preserved a clean weight version for the backward pass while still running a 1F1B schedule, recovering synchronous semantics. The paper reports accelerating GPT and BERT pretraining by up to 20 times against tuned baselines.

Present: the schedule almost everyone uses

The version of pipeline parallelism that dominates production training in 2026 is synchronous 1F1B, sometimes with an interleaved variant on top. The synchronous bit means every microbatch sees the same weights, the bubble formula above is the floor you cannot beat, and the activation memory is held only as long as it has to be.

The activation argument is the practical reason 1F1B replaced GPipe-style fill-drain almost everywhere. With GPipe, every microbatch's activations sit in memory until the global backward begins. With 1F1B, a microbatch's backward starts as soon as the deepest stage has finished its forward, freeing those activations early. Steady-state memory is roughly proportional to the number of stages, not the number of microbatches. That detail is what lets you push m up to 64, 128, or higher to shrink the bubble.

NVIDIA's Megatron-LM, Microsoft's DeepSpeed pipeline module, HuggingFace's nanotron, Colossal-AI, NVIDIA NeMo, and PyTorch's own torch.distributed.pipelining all default to 1F1B. PyTorch's API is the cleanest way to read the abstraction. You wrap your per rank slice of the model in a PipelineStage, hand the stages to a Schedule1F1B or ScheduleInterleaved1F1B, and call step() on the input batch. The framework slices into microbatches and runs the sends and receives between adjacent ranks. The same module exposes ScheduleGPipe and a Zero Bubble variant for comparison.

Present: the interleaved trick that shrinks the bubble again

The Megatron-LM 1T paper's main contribution is not 1F1B itself but the interleaved schedule, sometimes called virtual pipeline parallelism. The idea is that each physical rank can hold more than one non-contiguous slice of the model. If you have 8 GPUs and virtual size 2, then rank 0 holds layers 1 to 5 and layers 41 to 45, rank 1 holds 6 to 10 and 46 to 50, and so on. From the schedule's perspective there are now 16 virtual stages, not 8.

The bubble formula becomes:

bubble fraction = (p - 1) / (v m + p - 1)

where v is the virtual pipeline size. For 8 ranks, v of 2, and 32 microbatches, the bubble drops from 7/39 (around 18 percent) to about 7/71 (around 10 percent). The cost is that each microbatch now has to be sent twice as often between devices, since a rank's two slices are separated by other ranks' work. On the right interconnect, this is a clear win. Megatron-LM's authors reported 502 petaFLOP/s on 3072 GPUs training a 1 trillion parameter model at 52 percent of theoretical peak, with the interleaved schedule contributing more than 10 percent to throughput over plain 1F1B.

Present: a 2024 paper that claims zero bubble

In late 2023, Penghui Qi and colleagues at Sea AI Lab published Zero Bubble Pipeline Parallelism, the first synchronous schedule to drive the bubble all the way to zero. Their move is to split the backward pass into two pieces. The B pass computes the gradient of the loss with respect to the inputs of a layer, which is what the previous stage needs to keep going backward. The W pass computes the gradient of the loss with respect to the weights of that layer, which only the local optimizer step needs. By delaying W and letting B race down the pipeline, the schedule can fill what would otherwise be idle slots with W work. The authors report up to 23 percent more throughput under matched memory and 31 percent with relaxed memory, all without changing the gradient semantics. Megatron-LM has merged the scheduler upstream, and the technique is starting to show up in framework defaults.

Drawing the schedule on paper

The clearest way to internalize pipeline parallelism is to draw it on a grid. Each row is a GPU, each column is a unit of time. With p stages and m microbatches, you mark a forward of microbatch i on stage j at the column where it runs. For GPipe, all forwards form a parallelogram pointing right, then all backwards form a parallelogram pointing left. The empty triangles at the top right and bottom left are the bubble. For 1F1B, the parallelogram is replaced by a tighter band of interleaved forwards and backwards in steady state. For interleaved 1F1B, the band is denser because each rank handles two virtual stages. For zero bubble, the W tasks slide into the corners and fill the triangles. Draw the four schedules and you understand pipeline parallelism better than most infrastructure teams who deploy it.

Future and impact: when pipeline parallelism is actually the right tool

The honest answer for most teams is: not as often as you would think. For models under roughly 30 billion parameters, plain data parallel plus FSDP is usually enough, and the operational simplicity is worth more than the last 10 percent of throughput. Between 30 billion and around 70 billion, FSDP plus tensor parallelism on a single node, or two nodes with NVLink between them, gets the job done without paying the bubble tax. Pipeline parallelism becomes structurally unavoidable at around 100 billion parameters and above, or when context lengths are extreme enough that activation memory blows past what FSDP alone can absorb. At that scale, the model simply will not fit any other way.

The composition with other parallelism axes is what actually ships at the frontier. The accepted layout for very large training runs places tensor parallelism inside a node, where its many per-layer all reduces can ride the high-bandwidth NVLink fabric, pipeline parallelism across nodes, where a single send and receive per microbatch tolerates inter-node latency, and data parallel or fully sharded data parallel across replicas of the TP-PP slice. The mapping principle is the one bandwidth engineers always come back to: put the most latency-sensitive collective on the fastest link.

The operational complexity is real. Tuning microbatch count, choosing virtual pipeline size, balancing layers so the slowest stage does not strangle the schedule, picking the composition with TP and FSDP, debugging point to point hangs in production, and rerunning all of it when the model shape changes: this is a job, not a checkbox. Megatron-LM, DeepSpeed, and the hosted training stacks absorb a lot of it, but the gap between the formula and the running cluster is wider than most papers admit.

The frontier after zero bubble is busy. Groups are pursuing asynchronous schedules with stronger consistency than PipeDream, schedules that adapt to heterogeneous accelerators where one stage is on a faster card than another, and schedules tuned for inference rather than training, where deep mixture of experts models serve across a pipeline at low latency.

Perform Digital's view, from helping enterprises stand up agents on top of large open models, is that the schedule is rarely the bottleneck for a team running an agent stack. The bottleneck is choosing whether to train your own model at all, and if so, picking a base small enough that FSDP suffices and large enough to matter. Pipeline parallelism belongs in the toolbox, but it is the last tool to reach for, not the first.

Council summary

Pipeline parallelism exists because depth itself can outgrow a GPU, and the bubble fraction (p - 1) / (m + p - 1) sets the floor on what microbatching can recover. GPipe proved the idea, PipeDream contributed 1F1B and forced the field to confront synchronous versus asynchronous semantics, and Megatron's interleaved schedule trimmed the bubble further by giving each rank two virtual stages. The Sea AI Lab zero bubble work in 2024 closes the loop on synchronous schedules by splitting the backward into B and W phases. For most teams under 100 billion parameters, FSDP plus tensor parallelism remains the simpler choice. The lesson is that the schedule, the math, and the interconnect have to be read together.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next