inference cost collapse

The Inference Cost Collapse, and What It Quietly Unlocked

Inference costs fell a thousandfold in three years. That collapse, more than any benchmark, is why the agent era arrived when it did.

When GPT-4 launched in March 2023, calling it cost 30 dollars per million input tokens and 60 dollars per million output tokens. A developer running a chatbot that handled a few thousand conversations a day could watch a five-figure monthly bill assemble itself out of nothing but text. Three years later, a model that scores as well or better on the same benchmarks costs a small fraction of that. Depending on exactly which level of capability you measure, the price to buy a fixed unit of it has fallen somewhere between sixtyfold and close to a thousandfold.

That collapse does not get the attention it deserves. The headlines go to new models, new benchmark records, new context windows. But the quiet fact underneath all of it is that the cost of one unit of machine reasoning fell off a cliff, and kept falling. It is the single most important reason the agent era began when it did, and it is worth understanding properly.

Where the cliff came from

For the first few years of the large language model business, inference was simply expensive. A model with hundreds of billions of dense parameters had to wake every one of them to produce a single token. That cost was passed straight to the customer, and it set a hard ceiling on what anyone could build. You could afford a chatbot that answered one question. You could not afford a system that asked the model a question, read the answer, asked a follow-up, checked itself, and tried again twenty times.

Then prices started moving, and they moved faster than almost anyone forecast. Andreessen Horowitz tracked the trend and gave it a name: LLMflation. Their measure was simple. Take a fixed level of quality, in this case a model good enough to score 42 on the MMLU benchmark, and watch what it costs to buy that quality over time. When GPT-3 first became generally available in November 2021, that quality cost 60 dollars per million tokens. By late 2024 the cheapest model clearing the same bar, a small Llama variant served by a third party, cost 6 cents. A thousandfold drop in roughly three years, or about 10x every year. The firm pointed out that this is faster than the cost decline of compute during the personal computer era and faster than the fall in bandwidth cost during the dotcom build-out. Higher up the quality scale the drop is real but smaller: a16z put the fall for GPT-4-level quality, an MMLU of 83, at roughly 62x between its March 2023 debut and late 2024.

Epoch AI, which studies these trends carefully, found the same shape with an important wrinkle. The decline is real but it is not uniform. Measured across six benchmarks, the price to hold a given capability level dropped at a median of around 50x per year, but the rate ranged from 9x at the slow end to 900x at the fast end depending on which capability you tracked. The price to match GPT-4 on a set of PhD-level science questions fell about 40x per year. The fastest declines all started after January 2024. Their honest caveat is that newer models may be more tuned to these specific benchmarks, so part of the apparent gain is measurement, not pure progress. The trend is still dramatic. It is just not a single clean number.

The OpenAI price list tells the same story without any benchmark mathematics. GPT-4 at launch in 2023: 30 and 60 dollars per million tokens. GPT-4 Turbo later that year brought output down toward 15 dollars. GPT-4o in May 2024 launched at 5 dollars in and 15 out, then took a 50 percent cut that October. GPT-4o mini, released in July 2024, came in at 15 cents and 60 cents, a model cheap enough that the per-call cost barely registers. Each step was a price cut at equal or better quality.

Why it fell

No single thing produced the collapse. Several independent forces pushed in the same direction at once, which is why the slope is so steep.

The first is hardware. Each new GPU generation delivers more inference throughput per dollar, roughly two to three times more than the one before. That alone is a steady tailwind, though it is the slowest of the forces and accounts for only part of the story.

The second is software, and this is where a surprising amount of the gain hides. Early inference setups left GPUs idle a great deal of the time, running at perhaps 30 to 40 percent utilization. Inference engines such as vLLM changed that. PagedAttention treats the model's memory the way an operating system treats RAM, in reusable pages, which cuts memory waste sharply. Continuous batching keeps the GPU fed by slotting new requests in beside running ones so the chip is never waiting. Speculative decoding runs a small fast model to draft several tokens, then has the large model check them in one pass, which raises the number of tokens produced per second without hurting quality. Stack these and utilization climbs to 70 or 80 percent. The same physical chip serves far more customers, and the cost per token drops with no new hardware at all.

The third is the model itself. Mixture-of-experts designs route each token to a small subset of the network instead of the whole thing, so a model with a trillion total parameters might only activate a few tens of billions per token. That delivers frontier-quality output at three to five times lower compute cost than an equivalent dense model. Alongside that, smaller models kept getting better. A compact model in 2026 often matches a much larger model from a year or two earlier, and a smaller model is simply cheaper to run. Quantization helps too: storing and computing weights at 8-bit or 4-bit precision rather than 16-bit cuts memory and compute by a further two to four times with little quality loss. Distillation, where a small model is trained to copy a large one, packages much of that gain into a permanently cheaper model.

A careful arXiv analysis tried to separate these. It divided the overall price decline by hardware progress and concluded that algorithmic efficiency, the software and model side, is improving at around 3x per year on its own. Hardware contributes roughly 30 percent a year. Multiply the two and you get the steep curve the market actually saw.

The fourth force is competition. When DeepSeek released its V3 and R1 models with open weights, it did two things at once. It priced its API far below the incumbents, with R1 running on the order of 90 to 95 percent cheaper than the closest closed reasoning model. And because the weights were public, any inference provider could host the same model and compete on price. Open weights turn a model into a commodity that many vendors sell, and commodities do not hold high margins. That competitive floor pulls the whole market down.

The quiet enabler of the agent era

Here is why this matters more than a benchmark record. An agent is not one model call. An agent is a loop. It plans, it calls a tool, it reads the result, it checks its own work, it tries again, sometimes it spawns sub-agents to handle pieces of the job. A single instruction like "review this contract and flag the risks" can quietly trigger eight to twelve model calls before anything appears on screen. A genuinely complex multi-agent task can burn through hundreds of calls and well over a million tokens.

Run that math at 2023 prices. A task that consumes a million tokens of GPT-4 output would have cost around 60 dollars. No business builds a product where every routine task costs 60 dollars in raw model fees. The economics simply forbid it. The agent loop, the plan-act-observe-repeat pattern that defines the whole current generation of AI software, was not waiting on a cleverer model. For the most part the models could already do it. It was waiting on the price.

The collapse removed the gate. Run the same million-token task on GPT-4o mini at 60 cents per million output tokens and the raw model fee is well under a dollar. On a small open-weight model served by a third party it can land in single-digit cents. A cost that once killed the product outright now rounds to noise on the invoice. Suddenly it is reasonable to let a model think in a long chain, to have it verify its own reasoning, to call a model dozens of times in a row, to leave a monitoring agent running around the clock. The behaviors that make an agent useful, the iteration and the self-correction, are exactly the behaviors that were too expensive before. Cheap inference is the substrate the agent era is built on. It rarely gets named as the cause because it is not a product you can launch. It is a background condition. But it is the condition that made the rest possible.

The counter-force: spending is going up, not down

There is a twist, and it catches a lot of people. If the price per token fell a thousandfold, you might expect AI bills to shrink. They are doing the opposite.

Menlo Ventures put enterprise generative AI spending at 11.5 billion dollars in 2024 and 37 billion in 2025, more than a tripling, roughly 3.2x, in the same year prices were collapsing. This is the Jevons paradox, named for the 19th-century economist who noticed that more efficient steam engines did not reduce Britain's coal use. They increased it, because cheaper coal made coal worth using for far more things.

The same logic governs tokens. When inference was expensive, price was a gatekeeper. It quietly killed every use case that could not justify the cost, which kept usage modest. Take the gatekeeper away and the suppressed demand floods in. Teams that ran one careful AI feature now run dozens. Agentic workloads, which consume 5 to 30 times more tokens per task than a simple chatbot, become normal. The cost per unit fell, the number of units exploded by more, and the bill went up. This is not a market failure. It is what a successful efficiency gain looks like. It does mean a CFO who budgeted for falling AI costs is in for a surprise, and it means cost has to be an engineering concern, tracked and capped, not an afterthought.

Does the cliff keep going down?

Probably yes, but more gently. Gartner predicts that by 2030, running inference on a trillion-parameter model will cost providers more than 90 percent less than in 2025, driven by better silicon, inference-specialized chips, higher utilization, and smarter model design. That is a real continued decline. It is also a far cry from 10x a year.

The reasons to expect a slower slope are concrete. The early software wins, the move from 40 percent to 80 percent utilization, were one-time gains and have largely been collected. Quantization cannot go much below 4-bit without quality damage. Hardware progress keeps to its steadier pace. Several forecasts now point to something like 3x to 5x a year through 2027, then slower. Two other things temper the picture. Providers are unlikely to pass every cent of their own savings to customers, and the deepest discounts have always landed on commodity open-weight models rather than the frontier. Cutting-edge reasoning capability stays comparatively expensive.

None of that changes the conclusion. The collapse already happened. A drop of up to a thousandfold in three years is the move that reset what software can do, and even a slower decline from here keeps widening the set of things an agent can afford to attempt. The price stopped being the reason not to build, and it is not going to become that reason again.

Council summary

This post argues that the steep fall in LLM inference cost, somewhere between sixtyfold and close to a thousandfold over three years depending on the capability measured, is the real reason the agent era arrived when it did, not any single model breakthrough. It traces the decline to four stacked forces, better hardware, inference software like vLLM, model efficiency from mixture-of-experts and quantization, and open-weight competition led by DeepSeek, then explains why agents, which loop and self-correct across many calls, only became affordable once the price fell. It is honest about the twist: cheaper tokens have driven enterprise generative AI spending up, not down, a textbook Jevons paradox, and it expects the curve to flatten to roughly 3x to 5x a year. The reader's takeaway is twofold. The cheap-inference unlock is permanent and already priced into what software can attempt, but token cost is now an engineering line item to track and cap, not a bill that will shrink on its own.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next