Every agent project has a moment where the mood changes. The demo has been shown, the room has nodded, someone with a budget says yes. Now the agent wants real credentials, a connection to the systems that hold customer records and money, and permission to take actions a person cannot quietly undo.
That is the gap this piece is about. Two companion posts diagnose it: one on why agents fail in production, one on why generative AI pilots fail. This one is the constructive answer: the ten gates an agent should clear before it acts on a live system, and the reason each one exists. If your agent cannot pass a gate, that is not a formality to wave through. It is a finding.
Origin: a checklist exists because the demo is a different machine
A demo is an honest measurement of the wrong system. It runs the happy path: clean input, an API that is up and fast, a task of three or four steps, no edge cases because the builder did not feed any. Under those conditions a competent agent succeeds nearly every time.
Production is a different machine wearing the same name. Input arrives malformed or contradictory. The API is rate limited, or its token expired at 2pm. The task is not four steps once you count authentication, retries, validation, and the external calls the demo never made. And the agent faces that harder question hundreds of times a day. MIT's NANDA study found that 95 percent of generative AI pilots deliver no measurable profit and loss impact. The cause is rarely the model; it is everything around the model that the demo never tested, and the checklist is the list of those things.
Present: the ten gates
Gates one and two decide whether the agent should ship at all, the middle six are the safety and control layer, and the last two cover the time after launch, which is most of the agent's life.
1. A numeric success criterion, agreed before the build.
Write the number first. One metric, a target, a date. Not "the agent should handle support tickets well" but something a spreadsheet can settle: resolve 70 percent of password-reset tickets end to end at under 40 cents each, with satisfaction no lower than the human baseline. Without a number a pilot cannot pass or fail, only be liked or disliked, and when budgets tighten anything without a hard figure is cut. The metric to anchor on is cost per successful outcome, and successful does real work: a successful task is a business result completed correctly without human rework, not an API call that returned a string. An agent that costs 10 cents a run but fails half the time costs 20 cents per real outcome, and a number invented after the build is just the result wearing a target. Agree it first.
2. An evaluation harness with a held-out test set and automatic scoring.
Trying an agent a few times is not measurement. A harness does it properly: a labelled set of test cases with known-good outcomes, an automatic scorer, and a number at the end. It is the agent's test suite, and it is not optional, because traditional assertions break on a non-deterministic system. Two parts get skipped and should not. The held-out set is cases the agent was never tuned against, so a good score means generalisation, not memorisation. Automatic scoring, a model-as-judge or a deterministic check, lets the harness run on every change rather than once a quarter. One framework drawn from more than 100 deployments scores the agent's trajectory, not just its final answer, and wires into CI so a regression shows up as a failed check on a pull request. Build it before the agent is good; the harness is how you find out whether it is.
3. Guardrails at the input, output, and action layers.
A capable model is not a safe product. Guardrails are the controls around it, and the pattern is defense in depth: three layers, because any one of them will miss something. Input guardrails screen what reaches the model: length caps, content filters, and a check for prompt injection, the hidden instruction in a document or web page that hijacks an agent once it can call tools. Output guardrails check what the model produced before anything consumes it: schema validation, a plausibility check, a toxicity filter. Action guardrails matter most, confirming a tool call is allowed, in budget, and in scope before it executes. OpenAI's guidance describes the same layered model; a fuller post covers guardrails for agentic systems.
4. Observability and tracing, wired in from day one.
An agent makes decisions you did not write, and when it fails the symptom and the cause are often several steps apart with nothing crashed. Without a trace of every step, input, model call, tool call, result, you are debugging a confident black box from a customer complaint. Wire it in before the first deploy, not after the first incident, because a trace you did not capture is gone. The standard has firmed up: the OpenTelemetry GenAI semantic conventions, a cross-vendor effort drawing on work from Google, Microsoft, and IBM, are converging on a shared vocabulary for model calls, agent invocations, and tool executions, so a trace is portable rather than locked to one vendor's SDK. The test of whether you have observability or just logs: pick a random production run from last week and explain what the agent did and why. The fuller treatment is in agent observability.
5. Cost controls and a hard spend cap.
Give an agent a loop and no ceiling and it will eventually spend money you did not plan to. A reasoning agent re-sends its growing context on every step, so cost accelerates rather than grows linearly. One vendor write-up describes an email triage agent that fell into a routing loop and sent 89 identical emails to one customer in 31 minutes before a human pulled its API key, 104 once the in-flight cycles drained. The control that works is layered: MindStudio's deployment guidance recommends caps at four levels, per session, per user per day, per tool, and an infrastructure ceiling, with the example that three Stripe calls in a session is fine and 300 is not. A budget alert is not a budget control: an alert fires after the spend, so only a hard limit that blocks the next API call stops the bleed. And decide what happens when the cap is hit, because an unhandled error is not a decision you want to discover in production.
6. Identity, scoped least-privilege access, and an audit trail.
An agent that can act needs to act as someone, and that someone should not be a borrowed human login or a shared admin key. It needs its own machine identity, so every action is attributable to one agent and revocable without taking down anything else. The access on that identity should be the minimum the task needs: read-only where a read will do, write scoped to specific tables, only the API endpoints the job requires. This is the highest-return control on the list. Teleport's 2026 research, summarised in BeyondTrust's write-up on least privilege, found organisations enforcing it reported a 17 percent security incident rate against 76 percent for those that did not, a gap driven by how much access the agent holds rather than which model it runs. The third piece is a tamper-evident audit trail of what the agent did, which credential it used, and what it touched, because traditional IAM was built for humans with static roles, not for software that touches a dozen systems in an hour.
7. A human in the loop for consequential actions.
Human-in-the-loop does not mean a person reviews everything; that removes the point of the agent. It means a person is in the loop for the decisions that warrant it, named in advance, and the line that names them is reversibility. Drafting a reply, retrieving a record, classifying a ticket: reversible, low stakes, let the agent run. Issuing a refund, sending email outside the company, deleting data, changing a production record: hard or impossible to undo, and those get an approval gate. Anthropic's guidance on building effective agents adds a related point: do not give the agent autonomy a fixed code path would handle, because every decision it owns is one more thing that can go wrong. Put the gates on the consequential actions specifically; scattered at random, they only train reviewers to click through.
8. Integration testing against the real systems and their failure modes.
A demo calls one stable API. Production calls a dozen, and they fail in ordinary ways: rate limits, expired tokens, a renamed field, a slow response, a 500. A mock will not surface them; agents that run only against mocks behave differently when they hit real latency and ambiguous responses. The practical setup is a staging environment with sandbox accounts for each integrated system, plus an agent sandbox: an isolated runtime where the agent takes real actions against controlled inputs without real consequences. Inside it, break things on purpose: kill the API mid-call, feed a malformed response, send the input that contradicts itself.
9. A rollback plan and a kill switch.
When the agent misbehaves at scale, you need to stop it fast and undo what can be undone, and both halves must exist before launch; a rollback designed under incident pressure is improvisation performed badly in front of an audience. A kill switch has to do more than terminate a process: it must stop the next tool call from starting and ideally fire automatically when a circuit breaker trips on action count, error rate, or spend, running outside the agent's logic so a misbehaving agent cannot bypass it. The part teams skip: stopping an agent does not reverse what it already did, so the rollback plan must define how far external state is unwound. The stakes are real. In April 2026 a Cursor coding agent at car-rental software firm PocketOS, holding an over-scoped token it found in an unrelated file, wiped the production database and its backups in nine seconds; the newest recoverable backup was three months old, and ServiceNow's CEO later used the case to argue for a built-in kill switch. Pair the kill switch with a staged rollout, canary traffic at 1 percent, then 10, then 50, so a bad release has a small blast radius.
10. An owner, plus a monitoring and review cadence after launch.
Launch is the start of the agent's working life, not the end of the project. An agent without a named owner is an agent nobody watches, and the model underneath does not stay still: providers ship updates, the world the agent reasons about shifts, and performance drifts down quietly with no error to announce it. So the last gate is a person and a schedule: one owner accountable for the agent's behaviour and its escalation path, and a monitoring setup that runs the harness against sampled live traffic and alerts when a score crosses a threshold, so a regression surfaces as a number before a complaint. And a review cadence decided rather than left to chance: high-stakes agents in fast-moving domains get reviewed often, stable ones less so. The honest difficulty is organisational: the team that built the agent is usually reassigned by the time it ships, and a gate with no owner is not a gate.
Future and impact
Two shifts will change how this checklist is used. The first is standardisation. Two years ago every gate here was bespoke; now the harness has shared metric frameworks, observability has the OpenTelemetry GenAI conventions, and agent identity has emerging standards from groups like CoSAI. Each gate is becoming something you configure rather than invent. The second shift is pressure. As agents move to longer autonomous chains, and as an enterprise runs hundreds of them rather than one, agent sprawl turns a missing audit trail or an unowned agent from a tidy-up task into a real security exposure. The checklist stops being best practice and becomes the baseline an auditor or an incident review expects to find.
None of these ten gates is exotic. They are scoping, testing, access control, observability, and ownership, the same disciplines that govern any system touching real money and real customers. The model is rarely what fails; the gates are the system around the model, and building that system is the work Perform Digital does for enterprise agent deployments. Agents make these disciplines feel new because the system is non-deterministic and acts on its own, but they are the ordinary engineering that turns an impressive demo into something you can run on a Tuesday without holding your breath. The teams treating the checklist as the project are shipping. The ones that mistook the demo for the finish line are inside the 95 percent.
Council summary
This post argues that the hard part of shipping an agent is not the model but the ten things a demo structurally cannot prove, and it treats each one as a gate rather than a suggestion. Its strongest move is framing a failed gate as a finding, not a formality, backed by numbers a reader can act on: the incident-rate gap from least privilege, the four-level spend cap, the nine-second PocketOS database wipe. The takeaway is practical and uncomfortable. If your agent cannot pass a gate, you have learned something real about its readiness, and the teams that work the checklist instead of the demo are the ones whose agents survive contact with production.
Comments