A dense 70 billion parameter model and a 671 billion parameter MoE can be served at the same per-token cost. The first does it by being small. The second does it by hiding most of itself on disk. The catch: the few experts the MoE runs for each token might live on different GPUs from where the token sits. So the token gets shipped across the network, processed on whatever chip its experts live on, and shipped back, every layer, billions of times per training step. Expert parallelism (EP) is the name for that arrangement, and it is the only parallelism strategy whose communication pattern is decided fresh by the model itself, token by token, while training is running.

This is the seventh and closing post in a series on distributed training. Earlier posts covered data parallelism, tensor parallelism, pipeline parallelism, ZeRO sharding, communication collectives, and the 3D blends frontier labs use. EP is the strangest of them and, as of 2026, the most important new one. The piece builds on an open-source explainer from Conscious Engines, which makes the point that EP differs because routing is data-dependent. Our framing and sources are our own.

Origin: a sparse idea that needed a network

If you already know what an MoE is, skip this section. A short refresher is on the mixture-of-experts post. The two facts you need: an MoE layer replaces the feed-forward block of a transformer with many small expert networks plus a router, and the router picks a top-k subset of experts per token, typically 2 of 8 in Mixtral, or 8 of 256 in DeepSeek-V3.

The trouble starts when you ask where those experts physically live. A Mixtral expert is small, eight fit on one GPU, and you can run the whole layer locally. A DeepSeek-V3 layer has 256 routed experts plus one shared expert, each a feed-forward block in its own right. They do not fit. So you split them across GPUs, one slice of the expert pool per device. That is EP in one line: experts get partitioned across the GPUs in a group, and a token routed to expert 42 has to reach the GPU that holds expert 42.

The idea was introduced at production scale in GShard (Lepikhin et al., 2020), which trained a 600 billion parameter translation model on 2048 TPU v3 chips in four days. GShard used top-2 gating and laid out how to dispatch tokens to experts on remote devices using the all-to-all collective, then combine results back. Switch Transformer cut routing to top-1 and pushed past a trillion parameters (Fedus et al., 2021). The communication shape they used is still the shape every modern MoE trainer uses.

The all-to-all collective, the part everyone underestimates

A transformer training step uses three big communication patterns. Dense data parallelism needs all-reduce, where every GPU contributes a tensor and every GPU gets the average. Tensor parallelism uses all-gather and reduce-scatter, where every GPU sends one slice and receives the whole or a reduced version. Both have a single direction of traffic and the same payload every step.

EP uses all-to-all, which is different. Every GPU sends a different chunk of data to every other GPU. The traffic matrix is dense by construction. If you have 64 GPUs, all 64 are talking to each other at once, and the payload each pair exchanges is decided by the router for that step.

That last clause is the catch. With dense parallelism the payload size is fixed at compile time. With EP it is not. The router looks at the batch, picks the top-k experts per token, and only then does the runtime know how many tokens are going from GPU 3 to GPU 17. Every layer rolls that dice again. The all-to-all has to handle a payload whose size it cannot predict, twice per layer: dispatch to experts, then combine expert outputs back to the right tokens on the right GPUs.

NCCL on NVIDIA and RCCL on AMD always offered an alltoall primitive, but early implementations were not tuned for the MoE pattern, where messages are tiny, latency-sensitive, and asymmetric. DeepSeek's open-source DeepEP library is the cleanest current answer: GPU-initiated dispatch and combine kernels, FP8 on the wire to halve dispatch payload, NVLink for intra-node hops, RDMA for inter-node. Its inference path runs low-latency kernels at batch sizes as small as a single token. NVIDIA's NCCL added a dedicated EP API for the same shape.

Top-k routing, capacity factor, and the tokens you drop

Two numbers govern an MoE layer's behaviour. The first is k, how many experts each token uses. The second is the capacity factor, how much imbalance the system tolerates before it starts throwing tokens away.

Top-k varies more than people realise. Switch Transformer set k = 1 for the simplest possible routing. Mixtral 8x7B uses k = 2 (Mixtral paper). GPT-OSS-120b, OpenAI's open-weight MoE released in August 2025, uses k = 4 over 128 experts (gpt-oss model card). DeepSeek-V3 and Qwen3-235B-A22B both use k = 8: DeepSeek picks 8 of 256 routed experts plus one shared expert that runs for every token, Qwen3 picks 8 of 128 (Qwen3 release post). The trend is upward. More experts, narrower experts, slightly higher k. Recent scaling-law work on fine-grained MoE found that splitting capacity into many small experts beats the older few-wide-experts design at the same compute budget.

Now the second number. In a perfect world the router would dispatch tokens uniformly, and each GPU's local expert would get exactly (tokens in batch) / N of them. In practice the router has favourites, especially early in training. To stop a single expert from being asked to process a thousand tokens when its allocated GPU only has memory for two hundred, the system pre-commits a fixed buffer per expert, sized as

expert_capacity = (tokens_per_batch / num_experts) × capacity_factor

with a capacity factor typically between 1.0 and 1.5. Set it at 1.25 and you give every expert 25 percent more headroom than its fair share. Anything beyond that buffer is a dropped token: the router picked an expert, the expert was full, the token skips the layer entirely and passes through unchanged (a residual connection saves it from being lost). Switch Transformer's JMLR write-up gives the full derivation. Higher capacity factor means fewer dropped tokens but more wasted compute on empty buffer slots. Lower capacity factor means tighter compute but worse quality.

DeepSeek-V3's training did not drop tokens at all. The team kept capacity loose enough to admit every token and relied on a different mechanism to keep experts balanced. We will get to that next.

Load balancing, and why routing collapses on its own

If you do not actively prevent it, a fresh MoE model collapses to a handful of experts. The mechanism is mechanical, not mysterious. Suppose expert 7 starts a hair better than the others. The router sends it a few more tokens. It improves faster than the rest. The router sends it even more. Within a few thousand steps the model is using a fraction of its experts and most of the parameters are dead weight. This is routing collapse, and every MoE paper from GShard onward has had to fix it.

The standard fix is the auxiliary load-balancing loss, introduced in GShard and refined in Switch. It penalises the model whenever the fraction of tokens routed to each expert deviates from uniform, with a small scaling coefficient so balance nudges the router without dominating learning. It works, and it also fights the model in a small way the whole time. The router gets optimised for two objectives at once, prediction and balance, and the balance term is a synthetic constraint that does not necessarily make the model better at its job.

DeepSeek introduced an alternative that has spread fast. Instead of adding a loss term, the team adds a per-expert bias to the routing logits and nudges that bias up or down based on recent expert usage. The bias steers routing toward balance without contaminating gradients used for prediction. The paper calls it auxiliary-loss-free load balancing (DeepSeek-V3 report, elaborated in Wang et al., 2024). DeepSeek-V3 trained on 14.8 trillion tokens for 2.788 million H800 GPU hours without an irrecoverable loss spike. Balance, by their account, was never the bottleneck.

The practical failure mode is not collapse itself, which auxiliary loss handles, but the imbalanced steady state that survives it. Even a well-trained MoE has hot and cold experts at inference time, with utilisation varying fivefold or tenfold between them. If a hot expert lives on GPU 0, GPU 0 stays busy and the rest of the cluster sits partly idle. EP serving stacks either accept that slack or replicate hot experts onto multiple GPUs, which costs more memory. There is no version where this is free.

Present: how EP composes with the rest of the stack

Expert parallelism is never used on its own. Frontier training stacks combine four or five strategies in the same job, and EP slots in as one more axis. DeepSeek-V3's training cluster ran 2048 H800 GPUs organised as 16-way pipeline parallelism by 64-way expert parallelism spanning 8 nodes, with ZeRO-1 data parallelism on top, and their custom DualPipe scheduler overlapping computation and communication. Their inference deployment uses 32 GPUs (4 nodes) for prefill and 320 GPUs for decode, running EP32 and EP144 respectively, with redundant experts replicated to spread load on the hottest routes (DeepSeek-V3 paper).

The composition rules are mostly common sense. Data parallelism replicates the experts across GPU groups for throughput. Tensor parallelism splits a single expert's matrices when an expert is too big for one device, rare with fine-grained experts but useful for shared ones. Pipeline parallelism splits the model by layer stages. ZeRO shards the optimiser state and gradients across replicas. EP partitions the expert pool. The catch is that adding EP makes every other dimension a little harder, because the data-dependent all-to-all cannot be pre-planned the way a fixed all-reduce can.

The major frameworks now ship EP first-class. NVIDIA's Megatron-Core supports EP composed with TP, PP, and DP, with token dispatchers built around all-to-all. Microsoft's DeepSpeed-MoE has five EP configurations, including ZeRO-offloaded variants that swap expert weights to CPU for memory-constrained training (DeepSpeed-MoE tutorial). PyTorch's native primitives cover the smaller end. None of these existed in usable form before 2022.

Every frontier MoE this year uses EP at training time. The lineup is short and worth knowing. DeepSeek-V3 and the V3.x line led the wave. Mixtral 8x7B and Mixtral 8x22B from Mistral. Qwen3-235B-A22B and Qwen3-30B-A3B from Alibaba. GPT-OSS-120b from OpenAI. Kimi K2 from Moonshot, a trillion-parameter MoE that holds 32 billion active. NVIDIA reports that more than 60 percent of open-weight releases in the last year use the architecture (NVIDIA on MoE frontier models).

Future and impact: bigger EP groups, smarter routing, prefill-decode splits

Three directions are visible from where we sit in 2026.

First, EP groups keep growing. DeepSeek's serving setup runs EP144 across 320 GPUs at decode. LMSYS published a recipe for serving the same model on 96 H100s with EP as the dominant axis (LMSYS large-scale EP blog). The arithmetic favours wider EP: more experts per GPU means each runs hotter, and modern interconnects handle the message-heavy traffic well. The ceiling sits wherever inter-node networking stops feeding the GPUs.

Second, routing is getting cleverer. Auxiliary-loss-free balancing is becoming the default. Per-sequence and per-batch routing variants are being tried to cut the imbalance variance that hurts serving. Expert-choice routing, where experts pick tokens instead of the other way around, guarantees balance by construction but breaks autoregressive decoding without extra work. Most production systems stay with token-choice routing.

Third, the prefill-decode split reshapes how EP gets deployed. Prefill runs at high arithmetic intensity and tolerates latency, so it uses smaller EP groups and wider TP. Decode is bandwidth-bound and latency-sensitive, so it benefits from very wide EP, FP8 dispatch, and the low-latency kernels DeepEP ships. Mixed deployments running prefill and decode on different GPU pools, each with its own parallelism mix, are now standard at scale.

The honest risk is that EP correlates a single model's failure modes with cluster health. A dropped link in the all-to-all fabric does not slow training the way a sluggish gradient sync does, it stalls it. Detection and rerouting around those faults is now production engineering territory. So is the cost picture. EP is cheap per training step when the network is fast and the experts are balanced, and when either slips, the savings evaporate and the model becomes more expensive to serve than a dense equivalent.

That is the picture. An MoE offers a large saving by routing each token to only a few of its experts. EP is the systems trick that lets the experts live anywhere in the cluster while the tokens still find them, paid for in the all-to-all collective and in the balance between top-k width, capacity factor, and load-balancing loss. The shape is decided by the model itself, every step, and that is what makes it both powerful and awkward. Without it the frontier MoE wave of 2024 and 2025 would not have shipped. With it, the cluster has a new failure mode and the network has a new boss.

Council summary

Expert parallelism solved the routing problem that GShard introduced and Switch Transformer scaled. Its defining feature is that the communication pattern is rewritten by the router every layer, every step, which is why all-to-all sits at the centre of the design. The 2025 generation, DeepSeek-V3, Mixtral, Qwen3, GPT-OSS, Kimi K2, all train and serve this way, with auxiliary-loss-free balancing and FP8 dispatch as the techniques that pulled it from interesting to standard. The honest trade is real: EP is cheap when the network is fast and experts are balanced, expensive when either slips. Anyone running MoE at scale ends up owning that trade.

Expert Parallelism: How MoE Routes Tokens Across GPUs

Origin: a sparse idea that needed a network

The all-to-all collective, the part everyone underestimates

Top-k routing, capacity factor, and the tokens you drop

Load balancing, and why routing collapses on its own

Present: how EP composes with the rest of the stack

Future and impact: bigger EP groups, smarter routing, prefill-decode splits

Council summary

Comments

Leave a comment

Origin: a sparse idea that needed a network

The all-to-all collective, the part everyone underestimates

Top-k routing, capacity factor, and the tokens you drop

Load balancing, and why routing collapses on its own

Present: how EP composes with the rest of the stack

Future and impact: bigger EP groups, smarter routing, prefill-decode splits

Council summary

Comments

Leave a comment

Agentic programming security: the fundamentals most teams skip

Privacy best practices for agentic AI: a consultant's checklist

AI agent governance: the framework most teams build too late