For about forty years, testing software meant one thing. You called a function with a known input, compared the return value to a known output, and asserted they were equal. Green or red. The test was honest because the code was deterministic: the same input produced the same output, run after run. assertEqual(add(2, 2), 4) holds today and in a decade.
Now point that habit at an agent. Ask it the same question twice and you get two different answers. Ask it to summarize a document and there are a hundred summaries a careful human would accept. The equality assertion has nothing to grab. The agent is not wrong; "exactly equal to the expected string" is simply a question with no useful answer.
This is the gap most teams fall into. They reach for the testing tools they know, those tools quietly stop meaning anything, and changes ship with no real signal about whether the agent got better or worse. What replaces the unit test is a discipline: the kinds of evaluation that work on a non-deterministic system, the evaluation harness that runs them, and the parts that stay hard.
Origin: why the assertion broke
The unit test rests on a contract that an agent cannot sign. Conventional code is a function in the mathematical sense: input maps to output, the mapping is fixed, and a test pins one point of it. A language model breaks that contract in two ways.
The first is non-determinism. A model samples its output token by token from a probability distribution, so the same prompt yields different text on different runs. People reach for temperature zero to kill this and find it does not work: production LLM APIs still return different outputs for identical inputs at temperature zero, because of batching, hardware floating-point order, and provider-side changes you do not control.
The second is harder: the space of correct answers is large. Even with the output frozen, there is no single right summary of a contract, no single right answer to a support question, no single right tool sequence for booking a flight. Many outputs are good, many are bad, and the line between them is judgment. An equality check cannot encode judgment. It encodes one accepted string, so it fails nine acceptable answers to pass the tenth.
So the question has to change. Not "did the agent return exactly X" but "does this output fall inside the set of acceptable outputs." That small rewording is the whole discipline. You stop writing assertions and start building a measurement instrument: a system that scores outputs against criteria, tolerates variation, and reports a rate rather than a verdict. That instrument is the evaluation, the eval.
Present: the kinds of evaluation that work
There is no single method. A real eval suite mixes several, each suited to a different kind of output.
Programmatic checks, for anything verifiable. Not every agent output is fuzzy, and when a failure can be confirmed by code, always use a code-based check. Did the agent return valid JSON against the schema. Did the SQL it wrote execute. Did the booking appear in the database. These are deterministic questions hiding inside a non-deterministic system, and you test them like ordinary software: cheap, fast, exact, run on every commit. Pull every verifiable slice into code and the genuinely hard problem shrinks to what is left.
Reference-based scoring, for output with a known target. When you do have a reference answer, you can score how close the agent landed. The old lexical metrics, BLEU and ROUGE, count overlapping words and n-grams. They are fast and weak: they reward word matching rather than meaning, so a correct paraphrase scores badly and a fluent wrong answer that reuses the words scores well. Embedding-based scoring like BERTScore compares meaning instead of tokens, which handles synonyms and paraphrase. Better, still blunt. It works where the target is constrained, like a factual extraction, and fails on open-ended generation where the reference is one of many good outputs.
LLM-as-judge, for the genuinely subjective. For everything that needs judgment, such as tone, helpfulness, faithfulness to a source, or whether an answer actually addressed the question, the dominant method is to use another model as the grader. You give a judge model the input, the agent's output, and a rubric, and it returns a score and a reason. It works better than its reputation suggests: strong judge models reach around 80 percent agreement with human evaluators, roughly the rate at which two humans agree, and structured methods like G-Eval push the correlation higher. The economics are decisive: a judge is orders of magnitude cheaper and faster than a human reviewer, the only reason scoring thousands of cases on every change is affordable. It also has real biases, and using it well means knowing them.
Rubric and criteria scoring, to make judgment legible. Asking a judge "is this good, one to five" produces noise. The fix is to decompose quality into named criteria and grade each separately: factual accuracy, completeness, tone, format adherence, policy compliance. Anthropic's guidance is clear rubrics with an isolated judge per dimension rather than one blurred overall score, and binary pass or fail per criterion beats a Likert scale, because five-point ratings produce ambiguous data that different graders read differently. The rubric is also where you build in partial credit: a support agent that diagnoses a problem but fails the refund step is better than one that fails immediately, and collapsing both to "fail" discards the signal you most need.
Trajectory evaluation, because the final answer can lie. Everything above grades an output. An agent is not an output, it is a process: a sequence of reasoning steps, tool calls, observations, and decisions. Grading only the last message is a trap, because two agents can reach the same answer down completely different paths. One queries the right API once; the other makes three redundant calls, touches a forbidden tool, and stumbles into the answer by luck. Outcome-only grading scores them identically and hands you a false positive. Trajectory evaluation, sometimes called glass-box evaluation, scores the path: was the tool choice correct, the arguments valid, the order sensible, did any step violate a policy. The gap is not academic. One 2026 walkthrough describes an agent holding an 82 percent final-output score while routing the wrong way on 40 percent of queries, invisible to outcome grading because the agent stumbled into a passable answer often enough. You need both: the outcome says the task got done, the trajectory says it got done in a way that holds. It is the same point that decides whether an agent survives the move from demo to production.
What an evaluation harness actually is
Methods are not infrastructure. A harness is the system that runs them, and it has four parts.
The first is a held-out test set: real cases, each with an input and either a reference answer or grading criteria, held out the way a machine-learning test set is so you do not tune the agent against it directly. It need not be huge. Anthropic recommends 20 to 50 tasks drawn from real failures, and the Pragmatic Engineer guide suggests at least 100 traces for the initial error analysis. Early changes have large effects, so a small set still catches them. You grow it as regressions get subtler.
The second is automatic scoring: every case runs through the grading logic without a human in the loop, combining programmatic checks, reference scoring, and judge calls into one dashboard, with several trials per case because the model varies.
The third is a trigger on every change. The harness runs on every pull request, prompt edit, and model swap, wired into CI, and a change that drops a score past a threshold is blocked from promotion to production exactly as a failing test blocks a merge. Without it, the harness becomes a thing someone runs by hand and forgets.
The fourth, the part that separates a real harness from a demo, is a feedback loop from production. When the agent fails for a real user, that failure is captured, reduced to a minimal case, and added to the test set as a permanent regression test, so the same failure cannot ship twice. The eval suite is never written once; it accretes.
Run those four together and the practice has a name: eval-driven development, with the eval moved to the front of the cycle and an academic process model now formalizing the loop as a reference architecture. The payoff is operational: teams with a real harness can upgrade to a new model in days, because the harness tells them within an hour whether the swap helped, while teams without one face weeks of manual spot-checking and still ship blind.
The tooling matured fast, from a category that barely existed three years ago. LangSmith ties tracing and evals tightly to LangChain and LangGraph; Langfuse is the leading open-source, self-hostable option; Braintrust is eval-first, built around CI gates that block a regression before release; Arize Phoenix and Weights and Biases Weave round out a field that roundups now compare head to head.
Future and impact: the parts that stay hard
Eval-driven development is the right practice, not a solved problem. A team that buys the tooling while ignoring three honest difficulties gets a dashboard that reassures and means little.
The test set is the hard part. A weak eval set produces confident, meaningless scores, and building a good one means real error analysis: reading transcripts, clustering failures, turning fuzzy complaints into specific, unambiguous cases. The bar Anthropic sets is that two domain experts should independently reach the same verdict on a task; if they would not, the task is too vague to test. It is also easy to build a one-sided set that checks when the agent should act and never when it should hold back. None of this is automatable, and it is slow.
The judge has biases, and you have to grade the grader. It is a model, so it brings model failure modes to the job of catching them. The documented ones are specific. Position bias: swap the order of two candidate answers and the verdict can change. A systematic study across 15 judge models and roughly 150,000 comparisons found that reordering flipped the verdict in more than 10 percent of cases, and the rate climbs when the two answers are close in quality. Verbosity bias: longer, more formal answers score higher even when the content is no better. Self-preference bias: a model scores its own outputs above equivalent work. The reliability ceiling is real. A March 2026 RAND study found no judge was uniformly reliable across benchmarks for safety, persuasion, misuse, and agentic behavior, and consistency broke on changes as small as a reformat; a separate bias benchmark put frontier judge error rates above 50 percent on hard cases. The mitigations are known and partial: randomize answer order, judge each criterion in isolation, give the judge a reference to anchor on, use a judge from a different model family than the generator, and calibrate against human spot-checks, recalibrating when divergence passes roughly 20 to 25 percent. You do not get to trust the judge. You measure it and correct for it.
Cost is the third constraint, and it compounds. A judge call running on thousands of cases, several trials each, on every pull request is a real line item, worse still when you tune the judge itself, since evaluating a judge configuration means running many evaluations. The discipline that controls it is the same one that makes the suite good: deterministic checks carry the bulk, the frontier judge is reserved for what genuinely needs it.
There is also a structural limit. A static test set is fixed, and a capable agent can outgrow it: Anthropic describes a model that solved a flight-booking task by finding a policy loophole, produced a better outcome for the user, and "failed" the written eval for not following the expected path. The eval was scoring conformity to an old assumption, not quality. It is the same reason benchmark scores drift from real performance.
That is the real shift. Testing used to be a gate you passed; for a non-deterministic system it is an instrument you maintain, a held-out set kept current, a judge kept calibrated, a feedback loop kept fed. The teams shipping reliable agents are not the ones with a smarter model. They are the ones who treated evaluation as core infrastructure from the first commit.
Council summary
This post argues that the equality assertion, the foundation of software testing for forty years, has nothing to grab onto when the system under test is a non-deterministic agent, and that the replacement is not a cleverer assertion but a measurement instrument. It walks the practitioner through the five evaluation methods that work, then defines an evaluation harness concretely: a held-out test set, automatic scoring, a trigger on every change, and a feedback loop from production. It is honest about what stays hard, the labor of a good test set, the biases of the judge model, the cost that compounds with every judge call, and it never lets the tooling stand in for that work. The reader should leave with a shift in posture: evaluation for agents is not a gate you pass once but infrastructure you maintain from the first commit.
Comments