Someone analyzed 500 Reddit complaints about AI tools across four subreddits over three months, expecting hallucination to top the list. It did not. The top frustration, named in 34 percent of the complaints, was a single sentence: it forgets everything every conversation. A separate 19 percent complained, in effect, about the same thing from the other side: I have to re-explain my context every time. Hallucination did not crack the top five. The thing people actually cannot stand is that the tool does not know them.

That is a memory problem, and it is the difference between a thing you demo once and a thing you use every day. A model that forgets you is a clever party trick. A system that remembers your project, your preferences, and the decision you made last Tuesday is a product. This post is about the engineering in the gap.

Origin: a stateless model has no memory, and the context window is not one

Start with an uncomfortable fact about how a language model works. The model itself is stateless. It is a fixed set of weights, frozen at the end of training, and it does not change when you talk to it. Run the same prompt twice and you get two independent answers from a function that learned nothing in between. The model that finishes your sentence is the exact same one that started the session, and the same one the next user gets.

So where does the apparent memory in a chat come from? The context window. Everything you have said this session, plus everything the model replied, gets concatenated and fed back in on every turn. The model looks like it remembers the start of the conversation because the start of the conversation is physically still in front of it, pasted into the input. That is not memory. It is short-term recall with a hard ceiling and a delete key.

Two things make the context window a poor substitute for real memory. The first is that it ends. Close the tab, start a new chat, and the window is empty: the information was never stored anywhere, it was sitting in a buffer that got cleared. The second is that even within a session, a bigger window does not rescue you. Recall sags as the window fills, and a model attends most to the start and the end and loses the middle, a failure we cover in context rot. You cannot fix forgetting by buying a longer context. The window is working memory, meant to be small and temporary. The part that is supposed to persist has to be built.

The clearest early statement of how to build it came in October 2023 from a UC Berkeley team. Their paper, MemGPT, made an analogy the field never let go of. An operating system gives a program the illusion of vast memory by paging data between fast RAM and slow disk. Do the same for a language model: treat the context window as RAM, an external store as disk, and let the model itself page facts in and out by calling functions. On the paper's Deep Memory Retrieval test, that paging took accuracy from 35 percent to 93 percent, measured against a recursive-summarization baseline on the same model. The model did not get smarter. It got a memory system bolted to the side of it.

Origin: the four kinds of memory an agent actually needs

Memory is not one thing. The useful taxonomy comes from a 2023 Princeton paper, Cognitive Architectures for Language Agents, known as CoALA, which mapped four categories from cognitive science onto agents. The split is worth holding in your head, because most memory bugs are a confusion between two of these.

Working memory is the live context window. It holds the current task: the system prompt, the conversation so far, the documents just retrieved, the last tool output. It is what the model reasons over directly, and it is wiped at the end of the session. Everything else has to be retrieved into this space to matter.

Semantic memory is facts. Your name, that you work in euros not dollars, that your fiscal year starts in April, that you prefer terse answers. These are stable, not tied to a particular moment, and they are the things a user is most insulted to repeat. When someone says the AI does not remember me, they usually mean its semantic memory is empty.

Episodic memory is events. Not the fact that you prefer terse answers, but the specific conversation last Thursday where you debugged the billing webhook and decided to retry on a 500 but not a 400. It keeps the narrative and the timestamps, and it is what lets an agent say we tried that in March and it failed rather than cheerfully suggesting it again.

Procedural memory is how to do things: the agent's skills, its tool-use patterns, the workflow for a kind of request. For most agents today this lives in the system prompt and the tool definitions rather than a database, which is why it is the memory teams think about least and version worst.

A single real query touches several at once. Ask an agent what was our Q4 win rate by segment: procedural memory routes the request to the right dataset, semantic memory supplies your certified definition of win rate, episodic memory checks whether you flagged that number as unreliable last week, and working memory assembles the lot for the model. Drop any one and the answer degrades in a different way.

Present: how memory is actually built

Strip away the branding and every memory system runs the same three-stage loop: extraction, storage, retrieval.

Extraction decides what is worth keeping. The agent does not save the raw transcript; that just recreates the context-rot problem in slow motion. Instead, a model reads each exchange and pulls out the durable facts, turning "actually we moved off Postgres to ClickHouse last month" into a clean statement and discarding the conversational packaging. This step is itself an LLM call, and it is where quality is won or lost, because a system that extracts noise will faithfully remember noise.

Storage puts those facts somewhere they survive a closed tab. Usually this means a vector store such as Pinecone, Weaviate, Qdrant, Chroma, or Milvus, where each fact becomes an embedding, a list of numbers that places meaning as a position in space so similar meanings land near each other. Some systems add a knowledge graph alongside, so relationships between entities, not just text similarity, are queryable.

Retrieval brings the right memory back at the right moment. When a new message arrives, the system embeds it, searches the store for the nearest stored memories, and pastes the top few into the context window before the model answers. This is RAG pointed at the user's own history instead of a document corpus. Memory and retrieval are the same engineering problem on different time horizons: memory is just retrieval where the corpus is everything you have ever said.

The reason to do this dance, rather than dumping the whole history into a long context window, is cost and accuracy together. The most-cited public benchmark here is LOCOMO, 1,540 questions across multi-session conversations. Mem0's April 2026 paper reported its memory layer beating a full-context approach with 91 percent lower p95 latency and over 90 percent fewer tokens, while scoring 26 percent higher than OpenAI's built-in memory on an LLM-judged metric. Those are the vendor's own numbers, but the direction is not controversial: stuffing everything in is not just expensive, it is worse.

Present: a bolt-on layer versus a runtime that edits its own memory

Two design philosophies split the field, and the choice shapes your architecture.

The first is the bolt-on memory layer. Mem0 is the most adopted example, past 56,000 GitHub stars and integrated with 21 frameworks. You keep whatever framework you already use, LangGraph or CrewAI or your own loop, and call Mem0 as a component: hand it a conversation with add(), ask for relevant memories with search(), and it runs extraction, storage, and retrieval behind those two calls. The appeal is low commitment: memory becomes a service, and removing it leaves your agent untouched. The limit is that a layer called from outside cannot judge importance the way the agent itself could; it extracts on a fixed pipeline. Mem0's April 2026 algorithm fuses semantic similarity, keyword matching, and entity matching into one ranking rather than relying on vector similarity alone. The company reports it scoring 92.5 on LOCOMO and 94.4 on LongMemEval, again a vendor figure.

The second philosophy is the agent runtime with self-editing memory. Letta, built by the MemGPT authors and formerly named after that paper, is not a component you call but the runtime your agent runs inside. Its core idea, memory blocks, gives the context window a labeled layout: a block for the agent's persona, a block for what it knows about the human, each with a character budget. The agent edits those blocks itself, mid-conversation, by calling memory tools. Below them sit recall memory for searchable conversation history and archival memory for cold storage. The trade is the mirror image of Mem0's. Memory quality now depends on the model's own judgment and spends inference tokens, and the runtime is harder to walk away from because your agent is built into it. What you get back is an agent that curates what it knows rather than only piling it up.

The honest summary: pick the layer when you have an agent and want to add memory, pick the runtime when memory is the center of what you are building. A third angle treats memory as a graph. Zep builds a temporal knowledge graph through its Graphiti engine, tracking how facts change over time; its paper reports an 18.5 percent accuracy gain over a full-context baseline on LongMemEval with 90 percent lower latency. Consumer tools have converged here too: ChatGPT memory keeps an explicit, editable list of saved memories plus an implicit layer that draws on past chats.

Present: the hard problems nobody has fully solved

Storing facts is easy. Running a store that stays useful for a year is not, and the open problems are specific.

What to remember. Extract too little and the agent stays forgetful. Extract too much and the store fills with trivia that drowns the signal at retrieval. There is no clean rule for which offhand remark is load-bearing.

What to forget. This is the competency current tools fail most conspicuously. A store with no expiration or consolidation policy grows into a haystack. Human memory works precisely because it lets unimportant things decay; most agent memory only appends.

Stale memory. A fact can be true when stored and false later. The Mem0 team's own example: a memory of a user's employer is accurate until they change jobs, then becomes confidently wrong and keeps getting retrieved. Mem0 calls temporal reasoning its hardest category, and on the BEAM benchmark its reported score falls by roughly a quarter, 64.1 to 48.6, as the history scales from one million tokens to ten million.

Conflicting memory. New information that contradicts old information forces a decision the system often cannot make: which one is true? Append both and an agent can surface dozens of contradictory beliefs about the same user. Pure semantic similarity makes this worse, because a memory from five minutes ago and a contradicting one from five weeks ago look identical to a cosine-distance function, even though recency should break the tie.

These are not edge cases. They are the daily reality of a memory store in production, and the reason memory is still a research area rather than a solved feature.

Future and impact: memory that maintains itself

The clearest direction is letting agents tend their own memory during downtime. Letta calls it sleep-time compute: while no user is waiting, a background agent reorganizes what the primary agent knows, reconciles contradictions, and promotes raw episodes into clean facts. This is the reflection idea from the 2023 Generative Agents work turned into infrastructure. The bet is that staleness and conflict get solved not at write time but in a continuous background pass, the way human memory consolidates in sleep.

Expect memory to stop being a separate product. Vector databases are adding memory APIs, agent frameworks are shipping built-in memory, and consumer assistants treat it as a default. The standalone memory layer may end up where standalone vector search did: absorbed into the platforms, important but no longer a category you shop for.

Be clear-eyed about the risk. A memory that records a wrong fact will repeat it with total confidence forever, and one that absorbs a malicious instruction from a poisoned web page can carry that injection into every future session. Persistent memory turns a one-shot prompt injection into a standing compromise: it makes an agent more useful and a successful attack more durable in the same move.

One domain makes the cost of forgetting unusually concrete: a coding agent that re-derives the same codebase from scratch every session, a practical look at what that actually breaks and what persisting it changes is in agent memory that outlives the session.

For an enterprise, this is the line between a pilot that impresses and a system people rely on. An agent that cannot remember a customer across two calls is a demo. The memory architecture, what gets extracted, stored, retrieved, and forgotten, is what makes an agent feel like it knows your business. It is unglamorous, it has no clean solution yet, and it is where Perform Digital usually finds the distance between a toy and a product.

Council summary

This post argues that memory, not raw model quality, is what turns an agent demo into something people use daily, and backs the claim with engineering rather than slogans. It is honest where it counts: a model is stateless, the context window is working memory and not storage, and every memory system reduces to the same extract, store, retrieve loop borrowed from RAG. The Mem0-versus-Letta split is the useful core, a bolt-on layer you call against a runtime your agent lives inside, and the post rightly refuses to crown a winner because the answer depends on whether memory is a feature or the product. The reader leaves knowing the four memory types, why selective forgetting and stale facts remain unsolved, and that persistent memory upgrades a one-shot prompt injection into a standing compromise. The takeaway for an enterprise buyer: ask how an agent forgets, not only how it remembers.

Agent Memory: The Feature That Separates Toys From Products

Origin: a stateless model has no memory, and the context window is not one

Origin: the four kinds of memory an agent actually needs

Present: how memory is actually built

Present: a bolt-on layer versus a runtime that edits its own memory

Present: the hard problems nobody has fully solved

Future and impact: memory that maintains itself

Council summary

Comments

Leave a comment

Origin: a stateless model has no memory, and the context window is not one

Origin: the four kinds of memory an agent actually needs

Present: how memory is actually built

Present: a bolt-on layer versus a runtime that edits its own memory

Present: the hard problems nobody has fully solved

Future and impact: memory that maintains itself

Council summary

Comments

Leave a comment

Agentic programming security: the fundamentals most teams skip

Privacy best practices for agentic AI: a consultant's checklist

AI agent governance: the framework most teams build too late