The number on the model card keeps going up. A frontier model in 2022 read about 4,000 tokens at once, a few pages of text. By early 2024, Gemini 1.5 Pro had pushed the headline figure to a million, and Anthropic CEO Dario Amodei had said in a 2023 interview that there was no reason the context length could not reach 100 million words, roughly what a person hears in a lifetime. The pitch is intuitive. A bigger context window means the model can hold more of your codebase, more of the conversation, more of the document, so it should answer better. Paste everything in and let the model sort it out.
It does not work like that. The window size on the spec sheet is a capacity, not a competence. It tells you how many tokens the model will accept before it errors out, not how many it can reason over while staying accurate, and that second number is far smaller, often by an order of magnitude. Worse, the relationship is not flat. Accuracy does not hold steady and then fall off a cliff at the limit. It sags the whole way, and on many tasks it sags early. The field calls this context rot, and for anyone building agents it is one of the most important and least advertised facts about the tools.
Origin: a 2023 paper that found the U
The first careful evidence came from a Stanford-led team in 2023. Nelson Liu and colleagues published Lost in the Middle: How Language Models Use Long Contexts, and the design was simple enough to be hard to argue with. They gave a model a question and twenty retrieved documents, placed the one document that held the answer somewhere in the pile, then moved it around. If a model truly used its whole context, the position of the answer would not matter. The score would stay flat as the answer slid from the first slot to the last.
It did not stay flat. It traced a U. GPT-3.5-Turbo, reading twenty documents, scored about 75.8 percent with the answer in the first position. Move it to the middle and the score fell to 53.8 percent. Move it to the last position and it recovered to 63.2 percent. Strongest at the start, second strongest at the end, weakest in the middle. The shape held across Claude, MPT-30B, and LongChat, and the published version confirmed it for GPT-4 too, with higher absolute scores but the same dip.
Two findings still sting. First, in the worst case the model did better answering from memory alone with no documents at all, a closed-book score of 56.1 percent, than with the answer handed to it but buried in the middle. The retrieved context was not just wasted. It was dragging the model below where it would have been with nothing. Second, the team compared standard models against their extended-context siblings, GPT-3.5-Turbo against the 16K version, Claude against the 100K version. When the input fit in both, performance was nearly identical. A longer maximum window did not buy a better ability to use it.
The shape has a name in psychology: the serial-position effect, how people recalling a list remember the first and last items and lose the middle. A transformer should be immune. Its attention can reach any token from any other in a single step, so there is no positional reason it should forget the middle. That it does anyway is a clue the problem is learned, not architectural, and the paper found supporting evidence: the U-shape appeared even in base models with no instruction tuning, and only in models above a certain size.
Present: the rot is measured, and it is everywhere
Lost in the Middle was about position. The follow-up asked a harder question: forget where the answer sits, what does the sheer length of the input do on its own? In 2025 the team at Chroma ran the most thorough test to date, Context Rot: How Increasing Input Tokens Impacts LLM Performance. They evaluated 18 frontier models, including Claude Opus 4 and Sonnet 4, GPT-4.1 and o3, Gemini 2.5 Pro and Flash, and the Qwen 3 family. Every one degraded as input length grew, on tasks that stayed trivially simple the whole time.
A few of their experiments kill the standard objections. The famous benchmark for long context is needle-in-a-haystack: hide a sentence in a long block of filler and ask the model to find it. Models ace it, and vendors quote near-perfect scores. The catch is that the classic test relies on lexical overlap. The question and the planted sentence share words, so the model is keyword matching, not reasoning. Chroma varied the semantic similarity between question and needle, and when the answer required a small inference rather than a word match, performance fell off much faster as the haystack grew. Adobe's NoLiMa benchmark made the same point by removing all surface-word overlap: at 32,000 tokens, 11 tested models dropped below half their short-context baseline, and even GPT-4o fell from 99.3 percent to 69.7 percent. The benchmark that made long context look solved was measuring the easy version of the problem.
Chroma's other experiments closed the remaining escape routes. Adding distractors, passages that look relevant but are not, hurt with one and compounded with four. On a conversational memory benchmark, LongMemEval, they compared a focused input of around 300 tokens holding just the relevant turns against a full input of around 113,000 tokens holding the whole history. Every model family did worse on the full input, despite it containing the same answer. And models often did worse when the haystack was a coherent, logically ordered document than when it was shuffled text. The structure a human would find easier made the model worse.
The marketing gap follows directly. A 2026 round of needle-in-a-haystack testing found that on a multi-needle task, the kind that resembles real work because it requires combining several facts, scores fell 30 to 60 points between 200,000 and one million tokens for three of the four leading frontier models. Single-needle scores stayed high and look great in a launch post. The honest multi-needle scores did not. The window over which retrieval and reasoning genuinely hold up, the effective context, is much shorter than the advertised one, and the spec sheet does not tell you where it ends.
The mechanism is not mysterious. Anthropic, in its guide to effective context engineering, frames it as an attention budget. A transformer makes every token attend to every other token, so n tokens produce on the order of n-squared pairwise relationships. Double the context and those relationships roughly quadruple, while the parameters available to manage them stay fixed. The attention gets, in Anthropic's words, stretched thin. Training compounds it: models see far more short sequences than long ones, so they have less practice using the far end of a long window. Anthropic's blunt conclusion is that context is a finite resource with diminishing marginal returns. More is not free, and past a point more is negative.
Present: the four ways a context fails in a live agent
Context rot is the gradual version, the slow sag as the window fills. In a running agent the failures are often sharper, and the writer Drew Breunig gave them names the field has adopted. His piece How Long Contexts Fail lists four modes worth memorizing.
Context poisoning is when an error or hallucination lands in the context and gets referenced again and again. Google DeepMind's report on its Gemini agent playing Pokemon described exactly this: when the agent's running list of goals got corrupted with a hallucinated objective, the model fixated on chasing something impossible, because the bad goal was now part of the context it trusted.
Context distraction is when the accumulated history grows so large the model leans on it instead of reasoning fresh. The same Gemini agent, past roughly 100,000 tokens of history, started repeating actions from its log rather than forming new plans, despite a window many times larger. A Databricks study found a related ceiling: in long-context RAG, Llama 3.1 405B's accuracy began falling around 32,000 tokens and GPT-4 around 64,000, well short of the limit.
Context confusion is when irrelevant material drags the model off course, because it feels obliged to use what it was given. The clearest case is tools. Breunig cites a benchmark where a quantized model failed when given all 46 available tools but succeeded when the set was cut to 19, even though all 46 fit comfortably inside the window. Nothing overflowed. The model was confused by the surplus.
Context clash is when two parts of the context contradict each other. A Microsoft and Salesforce team, in LLMs Get Lost in Multi-Turn Conversation, fed benchmark problems to models in pieces across multiple turns, the way a real conversation arrives, rather than all at once. Average performance dropped 39 percent across six tasks. Models committed to an answer early, on incomplete information, then could not recover when the rest arrived and conflicted with the guess.
The thread connecting all four is that irrelevant or wrong context does not sit there harmlessly. The intuition that more information cannot hurt, that the model will just ignore what it does not need, is wrong. The model attends to everything in the window. Junk is not neutral ballast. It is an active drag on reasoning, and in an agent loop it compounds: a poisoned or distracted context at step three quietly corrupts steps four through twenty. That is the mechanical link between context rot and the compounding error that makes long agent runs unreliable. The GDELT project showed how wide the gap gets in practice: asked to find every mention of a name in real broadcast transcripts, Gemini 1.5 Pro, advertised at a million tokens, found 5 of 26 in one run, about 19 percent.
Future and impact: the practical response is to use less window
If a bigger window does not fix this, what does. The answer the field has converged on is counterintuitive only until you internalize the rot: do not try to fill the window, fill it well. This is the discipline of context engineering, and against context rot three moves do most of the work.
Retrieve less, but better. Dumping fifty documents into the prompt and letting the model find the answer is precisely what context rot punishes. Lost in the Middle found that reader accuracy saturated long before retriever recall did: past about 20 documents, adding more barely moved the score while steadily adding latency, cost, and noise. The better pattern is a tight set of high-relevance chunks, with the best ones ranked toward the start of the context where attention is strongest. Sharper retrieval beats more retrieval, which is why retrieval-augmented generation keeps mattering as windows grow. RAG is not a workaround for small windows. It is how you keep a large window clean.
Compact. A long-running agent will fill its window if you let it. Before it does, summarize the history into a tight brief and continue from that, keeping the decisions and open problems and discarding stale tool output. On a loop that runs for hours, that is the difference between an agent that holds its thread and one that drowns in its own logs.
Isolate. Give a focused subtask its own clean context. A sub-agent gets a narrow job and a fresh window, does the messy exploration, burns its tokens, and returns a short summary to the orchestrator, which never sees the mess. Practitioners report the payoff: when teams stopped chasing raw capability and started protecting clean contexts, agents that fell apart by turn 20 stayed coherent past turn 50.
Whether the rot is permanent is the honest open question. It is rooted in the quadratic cost of attention and in training that favors short sequences, and both could shift with new architectures. Some 2026 reasoning models already degrade more gently than their predecessors. But the gap between the advertised window and the usable one has not closed, and treating the spec-sheet number as a working budget remains a reliable way to ship an unreliable agent.
This is often where an enterprise agent breaks. The demo had a short, curated context. Production accumulates noise across long sessions, the window fills, the rot sets in, and the agent that worked last week fails in ways nobody can reproduce. The fix is not a bigger window. It is the unglamorous engineering of what gets retrieved, what gets compacted, and what each sub-agent sees. Designing the context rather than maximizing it is a large part of what an implementation partner like Perform Digital is brought in to get right.
The takeaway is plain. The context window is a budget, not a free shelf. Every token you add spends a little of the model's attention, and tokens that do not earn their place are not harmless, they are a cost. The teams shipping agents that hold up are not the ones with the largest window. They are the ones who treat the window as scarce and put the fewest, best tokens in it.
Council summary
This post argues that a context window's advertised size is a capacity limit, not a measure of how much the model can actually reason over, and that accuracy degrades steadily as the window fills rather than holding firm until the limit. It traces the evidence from the 2023 Lost in the Middle paper, through Chroma's 18-model context rot study and the NoLiMa benchmark that stripped away the lexical-overlap shortcut, to the four named failure modes that break agents in production: poisoning, distraction, confusion, and clash. The reader's takeaway is concrete and immediately actionable: treat the window as a scarce budget, retrieve fewer and better chunks, compact long histories, and isolate subtasks in their own clean contexts. The teams shipping reliable agents are not the ones with the largest window but the ones who put the fewest, best tokens in it. The council judged the claims sound and well sourced, the origin-present-future arc complete, and the prose clean of style traps.
Comments