Here is a fact that should not be true. DeepSeek V3, a model that trades blows with GPT-4 class systems, carries 671 billion parameters. When it reads a word and predicts the next one, it uses 37 billion of them. The other 634 billion sit idle for that word, wake for the word after, and go quiet again. The model is enormous and the work is small, and it does this on purpose.
That trick is called mixture of experts, usually shortened to MoE, and in about two years it went from a research curiosity to the default way frontier models are built. DeepSeek V3 and R1 use it. So do Mistral's Mixtral, Meta's Llama 4, Alibaba's Qwen 3, Moonshot's Kimi K2, OpenAI's open-weight GPT-OSS models, and, by Google's own description, the Gemini line. The leaked architecture of GPT-4 is a mixture of experts too. If you used a new model in the last year, you almost certainly used one. This post is about what that means, why it took over, and what it costs.
Origin: an idea that waited thirty years
Start with the problem MoE solves. A standard language model is dense, which means every parameter is involved in every calculation. Ask a dense model to predict one token and it runs the entire network, billions of multiplications, top to bottom. This is the architecture behind GPT-3 and most models through 2023, and it has an obvious flaw. A model that knows French poetry, Python syntax, tax law, and protein folding fires all of that knowledge to answer a question about any one of them. It is the equivalent of waking every specialist in a hospital to look at a sprained ankle.
The fix is older than the deep learning boom. In 1991 Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton published "Adaptive Mixtures of Local Experts" (the original 1991 paper). The idea: instead of one network that learns everything, train several smaller networks, each free to specialize, plus a small "gating" network whose only job is to decide which specialist handles a given input. On a vowel-sounds task the experts split the work cleanly. The concept was sound, but the hardware and scale to make it matter did not exist, so it stayed a footnote.
It came back when scale became the bottleneck. In 2017 a Google team led by Noam Shazeer, with Hinton and Jeff Dean among the authors, published "Outrageously Large Neural Networks" (Shazeer et al., 2017). They took the 1991 idea and made it sparse. The gating network would pick only the top few experts for each input and ignore the rest, so adding experts added knowledge without adding much compute. They built a model with 137 billion parameters, startling for the time, and reported capacity gains of more than a thousandfold for a small efficiency cost. The key phrase is conditional computation: the model decides, per input, how much of itself to use.
Two Google papers turned the technique into infrastructure. GShard in 2020 scaled it past 600 billion parameters for translation. The Switch Transformer in 2021 simplified the routing to a single expert per token and crossed a trillion parameters, the first model to do so (Switch Transformers paper). For a few years MoE stayed mostly inside the big labs, hard to train and harder to serve. Then the open-weight world picked it up, and the floodgates opened.
Present: the router, and the parameter count that splits in two
To see why MoE won, you need two ideas. The first is the router. The second is what it does to the meaning of "model size."
A transformer is a stack of layers, and inside each layer sits a feed-forward block, the part that does most of the heavy thinking. In a dense model that block is one large network. An MoE model replaces it with many smaller networks, the experts, and adds a router, a tiny network that looks at the current token and scores every expert on how well it fits (Cameron Wolfe on MoE LLMs). The router picks the top few, usually two, sometimes one, sometimes eight, and only those experts run. Their outputs get blended, weighted by the router's confidence, and the token moves on. The next token routes fresh, and the next layer routes again, so a single sentence threads through hundreds of expert combinations on its way to an answer.
One correction worth making early, because it trips people up. The experts are not departments labeled "law" and "biology." Nobody assigns them topics. Each expert is a slice of network that, over training, drifts toward handling certain patterns because the router kept sending it similar tokens. The specialization is real but statistical and mostly uninterpretable. An expert might end up handling punctuation, or numbers, or a grammatical structure with no clean English name. "Expert" is a useful metaphor and a slightly misleading label.
Now the second idea, the one that matters most. MoE splits the parameter count into two numbers that used to be one.
- Total parameters is everything the model contains, every expert, stored on disk and loaded into memory.
- Active parameters is the subset that actually runs for a single token, which is the router, the shared layers, and only the handful of experts that got picked.
Compute, and therefore cost and speed, tracks the active number. Knowledge capacity tracks the total. A dense model has one figure for both. An MoE model gets to make them far apart, and that gap is the entire payoff. The numbers from shipping models make it concrete:
- Mixtral 8x7B (Mistral, December 2023): 47 billion total, about 13 billion active. Eight experts per layer, two picked per token. It matched or beat the dense Llama 2 70B while running like a model a fifth its weight (Mixtral of Experts paper).
- DeepSeek V3 (December 2024): 671 billion total, 37 billion active. 256 routed experts per layer plus shared ones, with eight routed experts active per token (DeepSeek-V3 technical report).
- Llama 4 Scout and Maverick (Meta, April 2025): both run 17 billion active. Scout holds 109 billion total across 16 experts; Maverick holds 400 billion across 128 (Meta's Llama 4 announcement).
- GPT-OSS (OpenAI, August 2025): the 120 billion parameter model activates about 5.1 billion per token; the 20 billion model activates 3.6 billion (OpenAI on GPT-OSS).
- Kimi K2 (Moonshot, 2025): 1.04 trillion total, 32 billion active (Kimi K2 technical report).
Read the GPT-OSS line again. A model OpenAI ships as gpt-oss-120b that computes like a 5 billion parameter one per token. That is the deal MoE offers, and it explains the stampede. Dense scaling had become brutally expensive, because a bigger dense model costs proportionally more for every token it ever processes, in training and forever after in serving. MoE breaks that link. You buy a large total parameter count, which buys capability, and you pay the per-token bill of a small model. NVIDIA, which sells the hardware this runs on, reports that more than 60 percent of open-weight model releases in the last year use the architecture (NVIDIA on MoE frontier models). Cheaper tokens are the same force pulling down the price of running models across the board, covered in the inference cost collapse.
The tradeoffs, stated plainly
A technique this dominant collects a fan club that forgets to mention the bill. MoE has four real costs, and anyone choosing an architecture should know them.
It does not save memory. This is the misconception that catches teams out. MoE cuts compute, not storage. Because the router might pick any expert for any token, every expert has to be loaded and ready. Mixtral activates only about 13 billion parameters per token, but all 47 billion must sit in fast memory, more than a single 80GB GPU holds in full precision. Across MoE models the experts are usually the overwhelming majority of total parameters while only a sliver run at once (Mixture of Lookup Experts paper). You rent a big building to use a few rooms. The compute is cheap and the real estate is not.
Training can go unstable. The router learns alongside everything else, and it has a bad habit. Early on a few experts get slightly better, so the router sends them more tokens, so they improve faster, so the router leans on them harder. The rest starve. This is routing collapse, and left alone it wastes most of the model (Cameron Wolfe on MoE LLMs). The standard fix is an auxiliary loss, an extra penalty that pushes the router toward spreading tokens evenly. It works, but it is a balancing act: too weak and the experts collapse, too strong and it fights the model's real objective and dents quality. DeepSeek's notable contribution here was an auxiliary-loss-free scheme that nudges balance with a per-expert bias term instead of a competing penalty (auxiliary-loss-free load balancing paper).
Serving is a distributed-systems problem. A trillion-parameter model does not fit on one GPU, so the experts get spread across many. But the router might send a token to an expert on a different chip, which means tokens get physically shipped between GPUs and results shipped back, every layer, constantly. NVIDIA describes this as a near-instant all-to-all communication pattern, and it strains badly once a model spans more than eight GPUs and the traffic falls back to slower scale-out networking (NVIDIA on MoE frontier models). Much of the recent hardware effort, fast interconnects like NVLink and MoE-aware serving software, exists to keep that traffic from becoming the bottleneck.
Fine-tuning is fussier. Sparse models are more prone to overfitting on small datasets, and they need different settings than the dense models teams are used to (IBM on mixture of experts). The knowledge is there, but tuning it without disturbing the routing takes more care.
None of this stopped the architecture. It means MoE is a clear win for a large model served at scale to many users, and a worse fit for a model that has to squeeze onto a single small GPU, where a dense model of the same memory footprint will use that memory better.
Future and impact: more experts, smaller experts, faster traffic
The first direction is granularity. Early MoE models used a few wide experts: Mixtral had eight per layer. Newer ones use many narrow ones: DeepSeek V3 runs 256 routed experts per layer. Research on scaling laws for fine-grained MoE found that splitting capacity into more, smaller experts improves the quality-per-compute trade, with the right expert count being a parameter you can tune like any other (scaling laws for fine-grained MoE). DeepSeek pairs this with shared experts, a couple that run for every token to hold common knowledge, freeing the routed experts to specialize harder. Expect more experts, smaller, with smarter routing.
The second is economics, and it reaches beyond research. DeepSeek V3 trained in about 2.79 million GPU hours for a reported 5.6 million dollars, a fraction of what a dense model of comparable strength would cost (DeepLearning.AI on DeepSeek V3). The MoE design is a large reason that number is small. Cheap to train and cheap per token, MoE is a major force behind capable open-weight models arriving faster and frontier capability spreading past the handful of labs that could once afford it.
The third is the engineering frontier moving. With routing collapse largely tamed, the hard problem now is traffic: moving tokens between experts fast enough that communication does not eat the compute saving. New work tackles it from several angles, including lookup-based experts that swap heavy computation for cheap memory reads (Mixture of Lookup Experts paper). The next phase is less about how many experts fit and more about how fast work moves between them.
There is a quieter point worth holding onto. MoE is part of a broader retreat from the assumption that one model should do everything at full power all the time. Routing to a subset of experts inside a model rhymes with routing easy requests to small language models and saving frontier models for genuinely hard ones. The shape is the same at both scales: stop spending maximum compute on every token and every task, and spend it only where it earns its keep.
Council summary
This post argues that mixture of experts won the architecture race because it broke a link everyone assumed was permanent: that a bigger model must cost more for every token it processes. By replacing one dense feed-forward block with many expert sub-networks and a router that wakes only a few per token, MoE splits model size into two numbers, total parameters for knowledge and active parameters for cost, and lets them drift far apart. The piece teaches the core mechanics without math, the router as a dispatcher and the experts as statistical specialists rather than labeled departments, and grounds every claim in shipping models whose figures we verified, from Mixtral's 13 billion active to Kimi K2's trillion total. Its real value is the honest tradeoff section: MoE saves compute but not memory, invites training instability through routing collapse, turns serving into a distributed-systems problem, and complicates fine-tuning. A reader should leave able to read a model card, explain why a 671 billion parameter model can be cheap to run, and judge when that bargain applies and when a dense model is the smarter buy.
Comments