small language models

Small Language Models: The Case Against Always Going Biggest

Most of what an AI agent does is small, repetitive, and narrow. Sending it all to a frontier model is like hiring a surgeon to take blood pressure.

Watch what an AI agent actually does for an hour and the work looks nothing like the demo. There is no dazzling burst of reasoning, just a long string of small jobs. Read this email and decide if it is a complaint or a question. Pull the order number out of this paragraph. Turn this messy reply into clean JSON. Decide which of four tools to call next. These tasks are narrow, they repeat thousands of times a day, and not one needs a model that can also discuss Kant or debug a distributed database.

Yet most teams route every one of those calls to a frontier model anyway: GPT-5, Claude Opus, Gemini Pro, the largest thing the budget allows. It is a reflex, and an expensive one. It assumes more capability is always better, or at least never worse. For agent work that is wrong, and getting it wrong costs an order of magnitude on the invoice for output a smaller model would have produced just as well.

This piece makes the case for the other option: the small language model. Roughly the 1 to 15 billion parameter range, cheap enough to run by the million, fast enough to feel instant, compact enough to sit on a laptop or a phone. They are not a downgrade you accept to save money. For a large share of real agent work they are the correct tool, and the frontier model is the overqualified one.

Where the small model came from

For the first stretch of the language model era, bigger simply meant better. GPT-2 to GPT-3 to GPT-4 each got larger and more capable, and the scaling laws said capability rises predictably with parameters, data, and compute. The natural conclusion: the smartest move is always the biggest model you can afford.

Two things bent that logic. The first was that small models stopped being bad. Microsoft's Phi line was the clearest proof. Phi-2, at 2.7 billion parameters, was trained not on a bigger pile of internet text but on a filtered set of "textbook quality" data, and on coding and multi-step math it matched Llama-2-70B, a model twenty-five times its size. Data quality and training method, not raw parameter count, carried much of the load. SmolLM, Qwen, Gemma and others followed. Today a sub-10-billion model routinely matches a 2024-era frontier model on standard benchmarks. NVIDIA's researchers put the point precisely: the curve between model size and capability keeps steepening, so each new generation of small models sits closer to the large models of the one before.

The second was that the shape of the work changed. The chatbot era asked a model one open-ended question and wanted one good answer, so generality was worth paying for. An agent decomposes a goal into many small steps and calls a model for each. The job is no longer "be brilliant at anything," it is "do this one narrow thing reliably, ten thousand times." Different job, different winner.

NVIDIA's researchers offer a practical definition of "small": a model that runs on a normal consumer device and answers fast enough to serve one person in real time, which for now means roughly anything under about 10 billion parameters. The useful test is not the parameter count but whether the model runs cheaply, locally, and fast.

Why a small model is so often the right call

Start with cost and speed, because the gap is not subtle. Serving a 7-billion model runs roughly 10 to 30 times cheaper than serving a 70-billion-plus model, across compute, latency, and energy, and published API prices put the spread between a frontier flagship and a small model above 100 times per token. A small model also answers far faster. For an agent that fires eight or twelve calls before producing anything visible, and loops that all day, the cost gap decides whether the margin survives and the latency gap decides whether it feels responsive or stalls at every step. This is the economics that made loop-heavy agents affordable, covered in the inference cost collapse.

Second, a small model runs where a large one cannot: on a single GPU, often a consumer one, sometimes a laptop or a phone. That buys two things a frontier API cannot. One is offline and edge operation, an agent that works with no network round trip. The other is privacy: a model inside your own environment never ships data to someone else's cloud, the difference between "we can use AI here" and "legal said no" for a hospital or a bank.

Third, and least obvious, a small model can be the more accurate choice for a narrow task. A frontier model is a generalist coaxed into one tiny job through a long prompt. A small model fine-tuned on that exact job does one thing well, and can be locked to a single output format so it stops wrapping JSON in chatty preamble or renaming fields. On the Berkeley Function Calling Leaderboard, the benchmark for the tool calling agents depend on, purpose-built small models trade blows with far larger ones. Salesforce's xLAM family, built specifically for tool calling, has ranked at the top of that board, and its 8-billion member holds its own against frontier flagships many times its size.

The NVIDIA argument: small models are the future of agentic AI

In June 2025, NVIDIA researchers published a position paper with a blunt title: "Small Language Models are the Future of Agentic AI". Coming from the company that sells the hardware large models run on, that is notable. The paper makes three claims: small models are already powerful enough for the language tasks agents actually need, they fit the strict output formats agents demand better, and for the large majority of calls they are simply cheaper.

The observation underneath all three is the sharpest part. An agent, the paper notes, is a heavily scripted gateway that uses a narrow sliver of a model's ability. You pay for a generalist trained to do everything, then write careful prompts to stop it doing all but one small thing. The paper measures the waste by auditing three open-source agents. In MetaGPT, a multi-agent coding system, it judges that about 60 percent of model calls could be handled by a specialized small model. For Open Operator, a workflow agent, about 40 percent. For Cradle, which drives a computer's GUI, about 70 percent. The hard reasoning stays with the large model; everything else does not need it.

Why has this not already happened, if the case is so clear? The reasons are not technical. Tens of billions of dollars are sunk into centralized large-model API infrastructure, and that capital pulls the industry toward one default. Small models are still judged on generalist benchmarks that undersell their agentic strengths, and they get a fraction of the marketing attention. These are barriers, not flaws. Inertia erodes.

How to route between small and frontier models

Accepting the argument does not mean throwing away frontier models. It means putting each model where it earns its cost, and the pattern for that is routing. A router sits between your agent and several models and, for each call, picks which one handles the request. The dividing line is task difficulty. Send the small model the high-volume, narrow work: classification, entity extraction, intent detection, format conversion, routing decisions, templated generation, short summaries. Reserve the frontier model for what genuinely needs it: open-ended reasoning, hard planning, ambiguous problems, the long-context analysis where a generalist's breadth pays off.

Routing comes in two forms: hand-written rules mapping task types to models, or a learned router. RouteLLM, an open framework from LMSYS, trains a router on preference data and reports a striking result: on one benchmark it matched 95 percent of the strong model's quality while sending only about a quarter of calls to it. Commercial routers from Martian, Not Diamond, and others sell the same idea as a managed service, with reported savings of 20 to 97 percent depending on the workload.

The NVIDIA paper adds a clean way to find your own routing line. Instrument the agent, log every model call, cluster the logs, and the repeated narrow patterns surface on their own. For each one, take the frontier model's logged outputs as training data, fine-tune a small model to match them, and route that task down. You start with a frontier model everywhere, watch where it is wasted, and move that work to a specialist as you go. This is the instinct behind mixture of experts inside a single model, applied one level up to the choice of model itself.

The real tradeoffs

A small model is not a free win, and pretending otherwise gets a team burned. A small model knows less: it has compressed less of the world into its weights, so it is weaker on open-domain factual recall. The fix is to stop relying on its memory. Retrieval-augmented generation supplies the facts at runtime, and a small model with good retrieval beats a large model guessing from stale training data. For an agent, where most knowledge should come from tools anyway, this matters less than it first looks.

A small model also reasons less well. On long, genuinely hard chains of logic, a frontier model still pulls ahead. And distillation, the obvious shortcut, is not a guaranteed transfer of intelligence: research finds that models at the smaller end do not always benefit from a big model's long reasoning traces, and that training a small model on a larger one's outputs can increase hallucination when the teacher asserts things the student has no grounding for. It needs care.

And the fine-tuning is work. The whole case assumes you specialize the small model, and specialization has real cost: collecting and cleaning training data, running the tuning, evaluating, maintaining it as requirements drift. Methods like LoRA and QLoRA make the tuning itself a few GPU-hours, cheap enough to run overnight, but the surrounding work is not zero. The honest framing is a trade: ongoing per-call savings and lower latency for upfront specialization effort. At low volume the frontier model wins. At the volume real agents run, the trade pays off, often within months.

The model families worth knowing

The small-model field is crowded and genuinely good. The names worth tracking:

  • Microsoft Phi. Proved curated training data beats brute scale. Phi-4-mini, 3.8 billion parameters, reasons in the range of 7-to-9-billion models, MIT licensed.
  • Google Gemma. Open models from the same research as Gemini, roughly 1 billion to 27 billion. Its E2B and E4B variants, introduced with Gemma 3n and continued in Gemma 4, are tuned for phones and edge: the "E" is for effective parameters, and they run in 2 to 3 gigabytes despite larger raw weights.
  • Alibaba Qwen. A deep family from 0.6 billion up, with strong agentic and coding scores and consistently good results as a base for fine-tuning. The default starting point for many teams building specialized models.
  • Hugging Face SmolLM. Fully open and documented, down to sub-billion sizes that run in a browser. SmolLM3 at 3 billion outperforms other models its size and rivals some 4-billion ones.
  • NVIDIA Nemotron. Hybrid Mamba-Transformer designs aimed at agent workloads. The recent Nemotron 3 Nano Omni is a roughly 30-billion-parameter mixture-of-experts model that activates only about 3 billion parameters per token.
  • DeepSeek R1 distillations. Reasoning-focused models, 1.5 to 8 billion parameters, distilled from the much larger R1. Proof that reasoning ability, not just knowledge, packs small.
  • On-device models from Apple and Google. Apple ships a roughly 3-billion parameter model on the iPhone and exposes it to developers; Google's Gemini Nano does the same on Android. The biggest consumer-tech companies have decided the model in your pocket should be small.

Where this is heading

The near-term direction is set, and Gartner's read is concrete: by 2027 it expects organizations to use task-specific models around three times as often as general-purpose ones. The shift will be quiet, because nobody markets it the way a frontier launch is marketed. Three things push it along: small models keep improving faster than large ones because the headroom at the small end is larger; fine-tuning keeps getting cheaper and more automated; and on-device hardware keeps getting better, widening the set of agents that run with no cloud at all.

The future for serious agent builders is heterogeneous by default: a fleet of cheap specialists doing the bulk of the work, a frontier model in reserve for the hard calls, and a router between them. Building that well is real engineering, and the evaluation that proves a small model is good enough before it ships is the part teams most often skip. That is the kind of work Perform Digital does.

The deeper point outlasts any model name. Reaching for the biggest model is a habit from the chatbot era, and the right tool for high-volume narrow work has never been the largest option on the shelf. Match the model to the task. The next decision, whether to self-host an open model or call a closed API, is its own question, covered in open weight versus closed models.

Council summary

This post argues that routing every agent call to a frontier model is a costly habit from the chatbot era. Most agent work is narrow and repetitive, and small models in the 1 to 15 billion parameter range now handle it reliably while running far cheaper, faster, and often on local hardware. Grounded in NVIDIA's agentic-AI paper, the piece names the model families worth tracking and gives a concrete routing recipe: instrument the agent, find the repeated patterns, fine-tune a specialist for each, and keep the frontier model in reserve for the genuinely hard calls. The takeaway is to match the model to the task, because the task is usually smaller than the reflex assumes.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next