For about two years the going advice for getting good output from a language model was to phrase the request carefully. Add a role. Give step-by-step instructions. Promise it a tip, threaten it, tell it to take a deep breath. People collected these tricks, sold courses on them, and put "prompt engineer" on business cards. Then, sometime in 2025, the advice stopped working as the main lever. Not because the tricks were wrong, but because the thing they optimized had stopped being the bottleneck.
The job did not disappear. It moved. The hard part of building with a model is no longer the sentence you type. It is everything else the model reads before it answers: the system instructions, the retrieved documents, the tool definitions, the outputs those tools returned, the running conversation, the notes from three steps ago. Designing that whole payload, deciding what goes in and what stays out, is a different discipline with a different name. The field settled on context engineering, and the shift changes what a competent AI build actually looks like.
Origin: why phrasing stopped being the bottleneck
Two things happened at once. Models got good, and single questions turned into loops.
The first half is easy to see. A 2023 model needed coaxing. You phrased a request three ways and one worked. By 2025 the frontier models followed plain instructions reliably, and the marginal return on a cleverer sentence collapsed. The trick-collecting era ended because the tricks stopped buying much.
The second half is the real cause. The unit of work shifted from a single model call to an agent loop. An agent is given a goal and runs in a cycle: it plans a step, calls a tool, reads what comes back, and decides what to do next, repeating until the task is done. A loop like that does not have one prompt. It has a prompt that gets rebuilt on every turn, growing as the agent works. After ten tool calls the model is reading its original instructions plus ten rounds of accumulated output. Nobody phrased that. It assembled itself. The question stopped being "how do I word this" and became "what should be in the window on turn eleven, and what should I have dropped by then."
The term crystallized fast, and from people building real systems. In June 2025 Shopify's CEO Tobi Lütke posted that he preferred "context engineering" because it named the actual skill: the art of providing all the context for a task to be plausibly solvable by the model. Andrej Karpathy endorsed it days later, with a sharper point. People hear "prompt" and think of a short instruction typed into a chatbot. But in any industrial-strength application, he wrote, context engineering is the delicate art and science of filling the context window with just the right information for the next step. Simon Willison, who had spent years defending the word prompt engineering, conceded the point: the term had been lost to the inferred meaning, the one where it just means typing into a chatbot. Context engineering had a better chance of meaning what it should.
So the distinction is not marketing. Prompt engineering is about how you ask. Context engineering is about what the model gets to see when you ask. The first is one sentence. The second is the entire information state, and in an agent that state is the system you are building.
Present: what a context engineer actually decides
The clearest working definition comes from Philipp Schmid, who calls it the discipline of building dynamic systems that supply the right information and tools, in the right format, at the right time, so the model can finish a task. The operative word is dynamic. A prompt is static text. Context is assembled fresh on every turn, and assembling it well means making four decisions, over and over.
What to include. Into the window go the system instructions, the user's request, the conversation so far, retrieved knowledge, tool definitions, tool outputs, and any long-term memory. Every one competes for the same fixed token budget. A context engineer decides which actually earn their place this turn.
What to leave out. This is the one people get wrong. The instinct is to give the model everything, on the theory that more information cannot hurt. It can, and it does, which is the next section. Schmid's blunt summary: most agent failures are no longer model failures, they are context failures. The model could have done it. It was handed the wrong stuff, or too much stuff.
How to order it. Models do not read a long context evenly. They attend most to what sits at the start and the end, and lose the middle. The instruction that has to survive goes where attention is strongest, not buried in the eleventh tool output.
When to compact. A loop that runs long will fill its window. Before it does, someone has to decide what to summarize, what to drop, and what to carry forward verbatim. Get this wrong and the agent either forgets its goal or drowns in its own history.
This is genuine engineering, not wording. It is closer to memory management or cache design than to copywriting, which is why the job title "context engineer" started appearing in real listings through late 2025 and into 2026, asking for people who can build retrieval systems and manage state.
Present: the core techniques
The practical work has converged on a small set of moves. LangChain groups them into four verbs: write, select, compress, isolate. That is a clean way to hold the field in your head.
Write means putting information somewhere outside the context window so it survives without costing tokens: a scratchpad for intermediate findings, a memory store that persists across sessions. The Manus team, who wrote up their lessons from building a production agent, treat the file system itself as unlimited external memory: the agent reads and writes files on demand, so it can drop a web page's full text from context while keeping the URL to fetch it again. Anthropic's agents use a related trick, structured note-taking, maintaining a running notes file that gets reloaded when needed. Their agent playing Pokemon kept tallies across thousands of game steps this way, well past any single context window.
Select means pulling the right information back in at the moment it is needed, rather than preloading it all. This is retrieval, where RAG sits inside the larger discipline. Anthropic frames the modern version as just-in-time: the agent holds lightweight references, file paths, queries, links, and loads the actual content at runtime only when a step calls for it. Claude Code works this way, writing targeted queries against a large database instead of reading the whole thing into context. Retrieval quality compounds: Anthropic's contextual retrieval work showed that prepending a short explanatory blurb to each chunk before indexing cut retrieval failures by 49 percent, and by 67 percent once reranking was added. Selection is not a solved checkbox. It is where careful work still moves the number.
Compress means reducing what is already in the window without losing what matters. Compaction is the headline technique: when a conversation nears its limit, summarize it and restart from the summary. Anthropic's advice is to tune the compaction prompt for recall first, capturing everything that might matter, then tighten for precision by cutting redundant tool outputs. Claude Code does this automatically near 95 percent of window capacity, preserving architectural decisions and unresolved bugs while discarding stale file reads.
Isolate means splitting context across separate agents so no single window has to hold everything. A sub-agent gets a clean window, a narrow task, and its own tools. It does the messy exploration, burns tens of thousands of tokens, and hands back a tight summary of a thousand or two. The orchestrator never sees the mess. Anthropic built their multi-agent research system on exactly this, and were candid that it uses around fifteen times the tokens of a single chat, which only pays off when the task is valuable enough. The point of a sub-agent is not role-play. It is a clean context.
Present: context is a budget, not a free gift
All of this matters because context is a scarce resource. Two findings make that concrete.
The first is mechanical. A transformer makes every token attend to every other token, so the relationships it tracks grow with the square of the length. Anthropic describes the result as an attention budget that gets spread thinner the more you load. Models also see far more short sequences than long ones in training, so they have fewer specialized parameters for long-range dependencies. A bigger window does not come with proportionally bigger competence to use it.
The second is measured. Chroma's 2025 context rot study tested eighteen frontier models and found performance degrades steadily as input grows, often well before the window is anywhere near full. A model advertised at 200,000 tokens can show real degradation by 50,000. This is not overflow. It is rot, and it kicks in early. The lesson for a context engineer is direct: the window size on the spec sheet is a ceiling, not a working budget, and stuffing it is a strategy that backfires.
Drew Breunig catalogued the specific ways a context goes wrong, and the names are now standard. Context poisoning: an error or hallucination lands in the context and every later step builds on it. Context distraction: the history grows so long the model starts parroting its own past actions instead of reasoning fresh, which Breunig saw in a Gemini agent past 100,000 tokens. Context confusion: spare tools or irrelevant text drag the model off course, and he cites a benchmark where a small model failed with 46 tools loaded but succeeded with 19. Context clash: two parts of the context contradict each other, and a Microsoft and Salesforce study found that feeding a model information in dribs and drabs rather than all at once dropped average performance by 39 percent. Each one is a context failure, not a model failure, and each is fixed by deciding what to leave out.
This is the direct line to reliability. Compounding error, the reason a chain of decent steps produces an unreliable whole, is in large part a context problem: a poisoned or distracted context at step three quietly corrupts steps four through twenty. Context engineering is the discipline of not letting that happen. It connects straight to context rot, the failure mode it exists to manage, and to agent memory, the write-and-select half of the discipline given a longer time horizon.
Future and impact: a real discipline, not a buzzword
It is fair to ask whether this is just the same work with a fancier label. It is not, and the tell is that the techniques are concrete, measurable, and increasingly built into the tooling. Compaction thresholds, retrieval recall numbers, KV-cache hit rates, sub-agent token budgets: these are engineering quantities, not vibes. The Manus team treat cache hit rate as their single most important production metric, because a stable context prefix is the difference between paying for cached tokens and uncached ones, a tenfold cost gap on the same model.
Where it goes next is toward becoming invisible. Today a team often hand-builds its context assembly. Increasingly the agent frameworks do it: automatic compaction, built-in memory layers, retrieval that fires without being told. The skill does not vanish when that happens. It moves up a level, from writing the plumbing to designing the policy. What should this agent remember between sessions and what should it forget. How aggressively should it compact, and what must never be summarized away. Those are judgment calls, and they are where reliability is won.
The honest risk is the one that sank prompt engineering: a useful idea gets thinned into a checklist. Already "context engineering" is showing up as a label on what is really just RAG with extra steps. The substance is not the vocabulary. It is the habit of asking, every turn, what is the smallest set of high-signal tokens that gets this step done. That question is doing the work. The phrase is just a handle for it.
For an enterprise moving agents from a promising demo to something that runs in production, that question is usually where the project is won or lost. The model is a procurement decision. The context architecture, what gets retrieved, what gets compacted, what each sub-agent sees, how memory persists, is the engineering, and it decides whether the agent is reliable enough to trust. It is the layer an implementation partner like Perform Digital tends to be brought in to build, because it is the part that does not come in the box.
Prompt engineering is not dead so much as demoted. Wording the request still helps at the margin. But the margin is small now, and the real work sits one layer out, in the deliberate design of everything the model reads. That is context engineering, and for anyone building agents it is no longer optional craft. It is the craft.
Council summary
The post argues that the lever for getting good output from a model moved from the sentence you type to the entire information state the model reads at each step of an agent loop, and that this shift is real engineering rather than a rebrand. It earns the claim with named sources and hard numbers: the term's June 2025 origin with Tobi Lütke and Andrej Karpathy, LangChain's write, select, compress, isolate framing, Chroma's context rot finding across eighteen models, Drew Breunig's four failure modes, and the cost mechanics of KV-cache hits. The reader's takeaway is a working discipline, not a label: treat the model's input as a budgeted, curated artifact, and on every turn ask what the smallest set of high-signal tokens is that gets the step done. For anyone moving an agent from demo to production, that question, and the retrieval, compaction, memory, and isolation policies that answer it, is where reliability is won.
Comments