agent observability

Agent Observability: Tracing a Decision You Did Not Write

When an agent fails, ask not where code went wrong, but what the model saw and why it chose that. A stack trace cannot answer it. Agent observability can.

A stack trace is an honest document. When ordinary software breaks, you open the trace and read the exact line that threw, then the line that called it, then the one above that, all the way back to the entry point. It works because you wrote every one of those lines. The control flow was decided in advance, by you, in code a reviewer could read. The failure is somewhere in a path you authored, and the trace points at it.

Now an agent fails. A support agent issues a refund it should have escalated. The stack does not throw, because nothing crashed. Every HTTP call returned 200. The model decided to call the refund tool, and that decision was not a line you wrote. It was a choice made inside a forward pass over a prompt you did not fully see, against a context assembled from documents, prior turns, and tool outputs you did not inspect. The trace you reach for has nothing to point at. The bug is not in the code. It is in a decision.

That gap is why agent observability exists as its own discipline. You cannot run a production agent without it, and the reason is not that it is good hygiene. It is that the failure mode changed shape, so the instrument has to change with it.

Origin: tracing was built for code you wrote

Observability did not start with agents. It started with the same problem one layer down: a request enters a system, fans out across a dozen services, and something somewhere is slow or broken, and a single machine's logs cannot tell you where.

Google's answer, published in the 2010 Dapper paper, gave the field its two load-bearing words. A trace is one request's full journey through a system. A span is one timed unit of work inside that journey, with a start, an end, and a parent. Spans nest into a tree, the tree is the trace, and the trace shows you which span ate the latency or returned the error. Twitter open-sourced Zipkin on the idea in 2012, Uber built Jaeger internally from 2015 and open-sourced it in 2017, and the approach eventually consolidated into OpenTelemetry, now the vendor-neutral standard for instrumenting software.

Every bit of that worked because the spans matched code paths a human designed. A span was a function call or an RPC. The trace was a map of decisions made at authoring time. Debugging meant finding the wrong branch in a set of branches you owned.

An agent breaks that assumption at the root. It runs a loop: the model reads its context, picks a tool, sees a result, and decides what to do next, again and again, until it judges the goal met. Nobody wrote that branch structure. The model generated it at runtime, and it generates a different one on the next run. Traditional logging records the lines you instrumented; the interesting event in an agent is the line the model invented. APM dashboards built for service health report latency, error rate, and throughput, all of which can look perfectly green while the agent picks the wrong tool, retrieves stale memory, or loops without making progress. A 200 response can wrap a confidently wrong answer, and a confidently wrong answer is the thing you most need to catch. The old instruments are not broken. They were built to answer "is the system up," and the agent question is "why did the model choose that."

Present: what an agent trace actually captures

An agent trace is the same tree of spans, repurposed. The root span is the whole task. Underneath it sits the real sequence the agent took, and a useful trace lets an engineer reconstruct what the agent did, in what order, with what inputs and outputs at every step. Four kinds of span carry the weight.

A model-call span records one round trip to the LLM. Not just that a call happened, but the input context the model actually saw, the system instructions, the messages, the tool definitions, and the output it produced, plus the model name, token counts, latency, and the finish reason. This is the span that answers what it saw. When an agent does something inexplicable, the explanation is almost always in the context that went into that call.

A tool-call span records one action. The tool the model picked, the exact arguments it constructed, the raw result, and latency, cost, retries, and error state. Arguments matter more than they look. A model builds tool arguments dynamically from prior context, so a hallucinated field or a wrong value is a common, quiet failure, and without the captured arguments your logs say a tool returned an error but never why the model called it that way.

A reasoning or state span records a decision point: the plan the model formed, the action it chose, what it observed, and the working state before and after. This is the span that surfaces plan drift, where an agent starts toward one goal and gradually shifts to another without ever explicitly replanning.

A memory or retrieval span records a read or write against a store: the query, the entries returned, their relevance scores, what was written and what triggered it. It is how you catch a stale read, where the agent retrieves something that was right last week and is wrong now.

Around every span sits the accounting: token counts, cost, and latency per step, with identifiers tagged at the root and inherited by every child so cost and quality roll up per user, task, and tenant. Read together, these spans turn an opaque outcome into a legible decision sequence. The refund agent from the opening stops being a mystery: the trace shows the retrieval span pulled an outdated policy document, the model-call span shows that document in the context window, and the tool-call span shows the refund firing on that bad input. The bug was four steps upstream of the symptom, exactly where an agent's bugs usually hide.

For years every vendor captured this differently, and a trace from one tool meant nothing to another. That is changing. OpenTelemetry's GenAI special interest group started work in April 2024 on semantic conventions for LLM and agent workloads: a shared vocabulary of span names and attributes. The conventions name operations like chat, invoke_agent, and execute_tool, and standardize attributes including gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.input.messages. An agent run renders as an invoke_agent span with child chat spans for each model call and execute_tool spans for each action. Be clear-eyed about maturity: the GenAI conventions are still marked Development, spans and attributes included, so the names can shift between releases. The model-call (chat) spans are the most settled and the most widely implemented; the agent and framework spans are newer and more likely to move. Auto-instrumentation packages exist for OpenAI, Anthropic, LangChain, and LlamaIndex, and the overhead is small, under a millisecond per call against model latency measured in seconds.

The tooling has grown into a real category. LangSmith ties tracing tightly to LangChain and LangGraph and is the natural pick for teams committed to that stack. Langfuse is open-source, MIT-licensed, self-hostable, and framework-agnostic through a full OpenTelemetry ingestion endpoint. Braintrust is eval-first, with tracing feeding a regression harness. Arize Phoenix is OpenTelemetry-native and notebook-friendly. Laminar, Weights and Biases Weave, and others fill out a field that roundups now compare directly. The detail that matters is not which logo wins. It is that observability and evaluation are converging into one loop. A trace from production that shows a failure becomes a test case in an evaluation harness; the harness runs that case on every change; a regression is caught before it ships rather than after a customer complains. Observability tells you what the agent did, evaluation tells you whether that was good, and the failures observability catches are the raw material evaluation needs.

Future and impact: the parts that stay genuinely hard

Instrumenting an agent is the easy half. Three problems do not go away when you buy a platform, and a team that ignores them gets a dashboard that looks reassuring but is not.

The first is volume and cost. A single agent run can produce dozens of spans, and a multi-agent system with handoffs can produce thousands, each model-call span carrying the full input context, which is large by construction. Store every span at full fidelity across production traffic and the observability bill becomes its own line item, sometimes rivaling the model spend it was meant to watch. The answer is sampling, and it has to be smarter than random. Tail-based sampling buffers a trace until it finishes, then keeps it based on what happened: every error, every timeout, every run in the top few percent by token spend, and only a small slice of the cheap healthy runs. Random head-based sampling drops exactly the traces you will want when an incident hits.

The second is privacy. An agent trace is, by design, a near-complete recording of what the agent saw and did, and what it saw is often a customer's data: names, account details, medical or financial information, whatever sat in the prompt. Shipping that to an observability backend unfiltered is a compliance problem. OpenTelemetry's conventions do not capture prompt and completion content by default, and the recommended pattern stores content as span events rather than indexed attributes so it can be filtered or dropped at the collector before it ever lands. Many teams run PII redaction in the pipeline and choose self-hosting so trace data never leaves their boundary. It is solvable, but only if designed in, not bolted on after the first audit.

The third is the deepest, because it strikes at what alerting means. Conventional monitoring alerts on a threshold: error rate above one percent, page someone. That works when a metric has a stable baseline. An agent does not. The same input produces different tool sequences and different outputs across runs, so a raw count of "different from last time" is noise, not signal. You cannot alert on non-determinism by pretending it is determinism. The workable approaches accept the variance and measure around it. Golden-set replay runs a curated set of cases through production on a schedule and pages only when the aggregate score drops past a noise threshold for two runs in a row. Distribution tracking runs a judge model over sampled live traffic and watches the score distribution shift, which catches a silent quality regression after a prompt edit or a provider-side model update, the failure that leaves every span returning 200 while the answers quietly get worse. Alerting on an agent is statistical, and a team still wired for binary thresholds will either drown in false pages or miss the regression entirely.

None of this is optional for a production system, and the adoption data shows the field has accepted that. LangChain's State of Agent Engineering report, a survey of more than 1,300 practitioners run in late 2025, found 89 percent of organizations running some form of agent observability and 62 percent doing step-level tracing that inspects individual steps and tool calls. Among teams with agents already in production the numbers climb to 94 percent and 71.5 percent. The direction is toward observability and evaluation as one practice, with the conventions standardizing fast enough that traces become portable across tools. This is also where an implementation partner earns its place. The hard part of shipping an agent was never the model. It is the instrumentation, the sampling policy, the redaction pipeline, and the statistical alerting that together let a team see what a non-deterministic system is doing. Perform Digital builds that layer, because an agent you cannot trace is an agent you cannot trust, and an agent you cannot trust does not belong in production.

The shift to hold onto is this. With software you wrote, debugging was archaeology in your own code: find the line, read the path back. With an agent, the path was authored at runtime by a model, so debugging means seeing what the model saw and reconstructing why it chose what it chose. A trace is how you see it. It is the difference between an agent that fails and an agent that fails in a way you can find, and that difference is why agents get to run in production at all. It is the same discipline that decides whether an agent survives the move from demo to production.

Council summary

This post argues that agent observability is a separate discipline because the failure mode changed: a stack trace can locate a bug in code you wrote, but it has nothing to point at when the bug is a decision a model made at runtime. It walks the arc cleanly, from Google's 2010 Dapper paper and the trace-and-span model, to the four span types that make an agent run legible, the still-experimental OpenTelemetry GenAI conventions standardizing the vocabulary, and the tooling now competing in the category. The honest core is the future section, which refuses to pretend instrumentation is the hard part: volume and cost, PII in trace data, and alerting on a non-deterministic system are the problems a platform purchase does not solve. The reader's takeaway is concrete. Capture what the model saw at every step, sample on the tail not the head, redact at the collector, and alert statistically, because an agent you cannot trace is an agent you cannot trust in production.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next