retrieval-augmented generation

Retrieval-Augmented Generation: From Basic RAG to Agentic

Retrieval-augmented generation bolts live knowledge onto a frozen model at query time. It has grown from a simple lookup into a small reasoning system.

Ask a language model what your company's refund window is, and it will answer with total confidence. It will also be guessing. The model was trained on a fixed slice of text that stopped at some date, and your refund policy was almost certainly not in it. It has never seen your wiki, your contracts, or last Tuesday's pricing change. So it does what these models do when knowledge runs out: it produces something fluent, specific, plausible, and possibly wrong.

That gap is the whole reason retrieval-augmented generation exists. A trained model is frozen. Its knowledge is baked into its weights and it cannot be topped up without retraining, which is slow and expensive. RAG is the trick that gets around the freeze. Instead of hoping the answer is somewhere in the model's parameters, you fetch the relevant text at the moment of the question and hand it to the model along with the question. The model stops answering from memory and starts answering from a source you control. This post traces how that idea works, where the simple version breaks, and how it grew into something that looks less like a lookup and more like a junior researcher.

Origin: parametric memory was not enough

The term comes from a 2020 paper out of Facebook AI Research, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, led by Patrick Lewis. The framing in that paper still holds up. A language model has parametric memory: facts smeared across billions of weights, learned during training, impossible to inspect or edit cleanly. The authors paired it with non-parametric memory, an external store you can swap, update, and point at. In their setup the store was a dense vector index of all of Wikipedia, reached through a neural retriever, with a BART generator writing the final answer.

The paper named the two problems this solved, and they are the same two problems enterprises hit five years later. One, a model cannot easily update what it knows. Two, it cannot show its work, so you cannot tell whether an answer came from real knowledge or from a confident invention. Tying the model to an external index fixes both. You update the index, not the model. You can see which passage drove a sentence. The 2020 results were already strong: the retrieval-augmented model produced more specific and more factual language than a model working from parameters alone.

For two years this sat mostly in research. Then ChatGPT made the knowledge-cutoff problem impossible to ignore. Every team that wanted a model to answer questions about their own documents arrived at the same shape, and RAG went from a paper to the default architecture for grounding a model in private or current data.

Present: naive RAG, and where it cracks

The standard build, what the literature calls naive RAG, has two phases. The first happens once, ahead of time. You take your documents, cut them into chunks of a few hundred words, and pass each chunk through an embedding model, which turns it into a vector, a list of numbers that places similar meaning in similar locations. You store those vectors in a vector database. The second phase happens at every question. You embed the question the same way, search the database for the chunks whose vectors sit closest, paste the top few into the prompt above the question, and let the model generate an answer from that context. Chunk, embed, store, retrieve, stuff, generate. It is a genuinely good baseline, and you can stand one up in an afternoon.

It is also where most RAG projects quietly stall, because each step has a failure mode that only shows up at scale. The 2024 survey of RAG by Yunfan Gao and colleagues lays them out, and practitioners confirm them daily.

Chunking is the first crack. Fixed-size splitting does not respect meaning. A contract clause that runs 600 words gets cut in half, and now neither piece carries the whole rule. The right document gets retrieved and the answer is still missing, because the part that mattered was severed at chunk three. The fix is not exotic, just careful: split on section and paragraph boundaries instead of a blind token count, and the same documents start returning whole rules instead of fragments. It is one of the highest-return changes in the pipeline, and one of the most overlooked.

Retrieval is the second. Pure vector search matches on semantic similarity, which is exactly wrong for the things that have to match literally: a product SKU, an error code, a surname, a contract number. Embeddings smooth those into a fuzzy neighbourhood, so the chunk with the precise term you need never makes the shortlist. If the retriever misses, the model has nothing. It will still answer.

The third is subtler, and it has a name. Even when retrieval works, models do not read a long context evenly. The 2023 Stanford study Lost in the Middle by Nelson Liu and colleagues showed that a model attends most to what sits at the start and end of its context and reliably loses the bit in the middle. Accuracy can fall by 10 to 20 points when the passage that holds the answer sits halfway down a long prompt rather than at the top. So stuffing twenty passages in does not help. Later work that isolated document count from context length found the same drift from the other direction: adding more retrieved documents, even at a fixed prompt size, made several models measurably worse. More context is not more signal.

The fourth is the deepest. Naive RAG retrieves exactly once, on the question as typed, and then generates. There is no reasoning about whether the retrieval was any good, no second attempt, no notion that a question might need two lookups. If the user asks something that needs information from two different documents, a single similarity search will not assemble it. The pipeline has no way to notice and no way to recover.

Present: advanced RAG patches the pipeline

Advanced RAG is the set of fixes for those cracks, and the Gao survey sorts them by where they sit: before retrieval and after it.

Before retrieval, the work is on chunks and queries. Smarter chunking follows the structure of the document, keeping a clause or a section whole, sometimes storing a small chunk for matching but handing the model the larger parent passage for context. On the query side, the user's phrasing is often a poor search key, so the system rewrites it. Query rewriting cleans up a vague question into a few sharper ones. A technique called HyDE, hypothetical document embeddings, takes this further: it asks the model to draft a fake answer first, then searches with that draft, on the logic that a hypothetical answer sits closer in vector space to the real passage than the bare question does.

The biggest before-retrieval fix is hybrid search. Instead of trusting vectors alone, you also run an old-fashioned keyword search, usually the BM25 algorithm, and merge the two ranked lists. Keyword search nails the exact tokens that embeddings blur. Vector search catches the synonyms and paraphrases that keywords miss. A method called reciprocal rank fusion combines the two lists by rank position rather than raw score, which sidesteps the awkward fact that the two systems produce numbers on incompatible scales. Hybrid search is now native in Weaviate, Elasticsearch, Qdrant, and OpenSearch, and across real corpora it beats either method on its own.

After retrieval, the main fix is re-ranking. The first retrieval pass is tuned for speed and recall, so it returns a wide net, maybe 50 to 100 candidates, some only loosely relevant. The trick is that the fast retrieval model embeds the question and each passage separately, then compares vectors, which is quick but blunt. A re-ranker, a cross-encoder, instead feeds the question and a candidate into the model together and scores how well they actually match. That is far more accurate and far too slow to run over a whole corpus, so you use each where it fits: retrieve broadly with the fast model, rank precisely with the slow one, keep the best handful. This two-stage shape is the standard production build, and published benchmarks put the gain from a cross-encoder re-ranker in the range of 10 to 25 percent on retrieval quality over single-pass vector search, with larger jumps on harder corpora. It also defuses the lost-in-the-middle problem, because once you trust the ranking you can drop to five strong passages instead of twenty mediocre ones.

This is the layer most production RAG lives in today, and the tooling reflects it. LlamaIndex is the common choice for retrieval-heavy systems, with a wide set of data connectors; LangChain carries the larger integration ecosystem. The vector database tier is crowded, with Pinecone, Weaviate, Qdrant, Milvus, Chroma, and pgvector all competing for the same workload, and no clear winner: the market splits on price, scale, and whether hybrid search is built in.

The Gao survey adds a third tier above advanced RAG, which it calls modular RAG: the rewrite, retrieve, rerank, and generate steps are pulled apart into swappable modules so a system can add a search loop or skip a stage. It is the bridge to what comes next, because once the steps are modular the obvious question is who decides which ones to run. Advanced RAG, modular or not, is a much better pipeline than the naive version. It is still a pipeline. The steps run in a fixed order, every time, with no judgment.

Future and impact: RAG that reasons

Agentic RAG is the change in kind. The shift, set out in a 2025 survey of agentic RAG, is to stop treating retrieval as a fixed pipeline step and put a decision-maker in charge of it. An agent is a model given a goal and a set of tools, run in a loop: it acts, observes the result, and decides what to do next until it is done. Point that loop at retrieval and the rigid pipeline becomes a process the system can steer.

In practice that means an agent can do several things naive RAG cannot. It decides whether to retrieve at all; a question the model already knows needs no lookup, and skipping it saves a step. It retrieves in a loop, reading what came back, noticing a gap, and searching again with a better query. It checks its own results, judging whether the passages are actually relevant before it trusts them, and re-retrieving or saying so if they are not. The Self-RAG paper from 2023 was an early version of this, training a model to decide on the fly when to retrieve and to critique what it pulled. And it can use more than one source: a vector index for documents, a SQL database for live numbers, a web search for current events, a customer data platform for a customer's history, choosing the right one for each sub-question. A question that needs two documents stops being a failure case, because the agent simply runs two retrievals and assembles the result.

This is genuinely better, and it has a genuine cost. Handing control to a loop means the loop can misbehave. Practitioners now track named failure modes: retrieval thrash, where an agent keeps searching, rephrasing the same query, never converging; tool storms, where calls cascade out of control and run up the bill faster than anyone is watching; and context bloat, where each retrieval dumps more text into the window until the model drowns in it. Every loop is more model calls, more latency, more money, and an autonomous loop with no limit set on it will find the most expensive path through your tools. The fixes are the unglamorous engineering of any agent: cap the retrieval iterations, set per-tool budgets, compress what comes back before it enters the context. The frameworks have followed, with LangGraph modelling these retrieval loops as explicit state machines, and the Model Context Protocol standardising how an agent reaches each retrieval source in the first place.

This is the seam where careful design pays for itself. A frozen model with a single similarity search is easy to demo and hard to trust. An agentic retrieval layer that knows when to look, when to look again, and when to admit it cannot find the answer is harder to build, and it is roughly the difference between a pilot and a system in production. It is the kind of work an implementation partner like Perform Digital is brought in for, after the first demo has impressed everyone and quietly failed.

One thing even agentic RAG handles poorly. Similarity search, hybrid or not, finds passages that resemble the question. It does not reason over how facts connect: which contract supersedes which, who reports to whom, which regulation depends on which. When the relationships between entities matter more than the resemblance between texts, the answer is to retrieve over a knowledge graph instead of a pile of chunks. That is the subject of part two, GraphRAG and Why Knowledge Graphs Came Back.

Council summary

The post argues that retrieval-augmented generation is best understood as a progression, not a single technique: naive RAG is a fixed chunk-embed-retrieve-generate pipeline that demos well and fails quietly at scale, advanced RAG patches the known cracks with smarter chunking, hybrid search, and re-ranking, and agentic RAG changes the kind of thing it is by putting a model in charge of deciding when and how to retrieve. The through-line is that each stage exists to fix a specific, named failure of the one before it. For the reader, the takeaway is diagnostic: when a RAG system underperforms, the fault is almost always a locatable step, and knowing which one tells you which tier of fix you actually need. The honest caveat is that agentic retrieval buys judgment at the price of cost, latency, and new loop-level failure modes that have to be engineered against. Part two takes up the one problem none of these tiers solve well, reasoning over how facts connect, which is where knowledge graphs re-enter the picture.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next