Agent Evaluation in Production: What to Measure and How to Prove It

What you get from this article

By the end, you will have a working vocabulary for production agent evaluation, a list of the metrics that matter and why, a clear picture of how agent failures hide from naive monitoring, and a concrete approach to building the evidence chain that lets you prove your agents work to an engineer, a customer, or a board.

Why model benchmarks do not transfer to production

A model benchmark measures token-level quality: does the model produce the right next token, given a fixed context? An agent evaluation measures task-level outcomes: did the agent accomplish the goal the user had, within acceptable cost and time, without doing anything it was not asked to do?

The gap between those two things is where most agent programs run into trouble. A model can score well on MMLU or HumanEval and still fail to complete a multi-step purchasing workflow because it loses track of state after six tool calls, or calls the wrong API when two endpoints have similar names. The benchmark tested the model; it did not test the system.

According to a Sinch survey of 2,527 senior decision makers across 10 countries, 62% of enterprises have AI agents live in production, and 74% of those agents get rolled back or shut down after launch. That is not a model quality problem. That is an evaluation and deployment problem. You can read more about the categories of issues that drive those rollbacks on the problems page.

The core production metrics

These five metrics cover most of what you need to know about a live agent. They are not exhaustive, but if you cannot measure all five, you have gaps that will eventually surface as incidents.

Task success rate

Task success rate is the fraction of agent runs that achieve the intended outcome, as judged by a verifiable criterion. The criterion matters. For an agent that processes purchase orders, success might be: the order was placed, the confirmation number was recorded, and no line items were altered without authorization. For a customer service agent, success might be: the user's stated problem was resolved without escalation.

The criterion has to be defined before you deploy, not inferred from logs afterward. If you cannot write it down, you do not yet know what success means, and you should not be shipping to production.

Quality scoring

Task success is binary. Quality scoring is continuous. An agent can succeed at placing an order and still produce a response that is factually wrong about the delivery date, or that uses a tone that violates your customer communication policy. Quality scoring applies rubrics, typically a mix of model-based evaluation and rule-based checks, to the content and behavior of each run.

The important design decision is separating quality from success. An agent that always succeeds but produces low-quality outputs is a different problem from one that fails outright. Each requires a different fix.

Cost per task

Cost per task is the total compute cost, including model inference, tool calls, and retries, divided by completed tasks. It is easy to hit high task success rates by letting agents retry indefinitely or escalate every uncertain decision to a more expensive model. Cost per task surfaces that pattern. If cost per task rises without a corresponding rise in task success rate, the agent is becoming less efficient, not more capable.

Latency

Latency for agents is more complex than for request-response APIs. An agent that takes 45 seconds to complete a task may be acceptable in a batch order processing context and completely unacceptable in a live customer conversation. Measure latency at the task level and at the span level. Span-level latency tells you where time is going: a slow tool call, a slow model response, a retry loop.

Klarna's customer service agent is a useful reference point here. The agent reduced response time from 11 minutes to under 2 minutes across 23 markets and 35 languages in its first month. That is a latency improvement that mattered to the business. The same deployment also showed a 25% reduction in repeat inquiries, which is a proxy for task success rate. By Q3 2025, the system had scaled significantly, but by May 2025, Klarna had rehired human agents after quality degradation on complex cases. The latency metric held; the quality metric did not. Both are necessary.

Drift over time

A deployment is not a snapshot. Models change, data distributions shift, APIs evolve, and user behavior moves. Drift over time measures whether your task success rate and quality scores are moving in either direction, and at what rate.

Most teams check this too infrequently. A weekly review of aggregate metrics will miss a degradation that starts on a Tuesday and accelerates through the week before anyone looks. Daily automated scoring against a held-out set of representative tasks is a more reliable baseline.

How agent failures hide in production

Agent failures are less visible than API failures. A 500 error is obvious. An agent that confidently completes the wrong task leaves no error code.

Confident wrong actions

A confident wrong action is when the agent completes a task, returns a success signal, and the output is wrong. The most common pattern: the agent misinterprets an ambiguous instruction and takes an action that is locally coherent but globally incorrect. A purchase order agent that rounds a quantity to the nearest standard unit is doing something plausible; it may also be causing a supply chain error that takes days to surface.

The fix is not to make the model more confident. It is to validate the agent's output against a schema or a set of constraints before the action is committed. That validation step is where activity schemas earn their value: you define the set of actions the agent is permitted to take in a given context, and any action outside that set is flagged for review rather than executed.

Ghost actions

A ghost action is something the agent does that nobody asked for. It might call an additional API endpoint during a lookup, send a notification the user did not request, or write a log entry to a location outside the expected scope. Ghost actions are particularly dangerous because they often succeed quietly and only become visible through their downstream effects.

Span-level tracing catches ghost actions. When every tool call is recorded as a span with its inputs and outputs, you can audit the full sequence of what the agent did during a task, not just whether it returned a success signal. Without that record, ghost actions are effectively invisible.

Silent degradation

Silent degradation is the pattern where task success rate and quality scores decline gradually, without any single event that triggers an alert. The Klarna case illustrates this: the agent handled 2.3 million conversations in its first month, with strong aggregate metrics, but quality on complex cases eroded over time until human agents had to be reintroduced.

A LangChain survey of 1,340 professionals in late 2025 found that 32% cited quality as the top barrier to production, and only 52% had implemented evals. If you are not running continuous evaluation, you are relying on users or support staff to notice quality declining, which is both slow and inconsistent.

Building the evidence chain

An evidence chain is the sequence of records that lets you trace from a user outcome back to the model decision that produced it. For agent evaluation, the chain has four layers.

Spans and traces

A span is a record of a single unit of work: one tool call, one model inference, one database read. A trace is the tree of spans that makes up one agent run. You get this data by instrumenting your agent with an SDK that records each span as it executes, capturing the inputs, outputs, timestamps, and any metadata you attach.

The span record is the foundation. Everything above it depends on having this data.

Scores

Scores are computed from spans. A quality score might apply a rubric to the model's final response. A risk score might flag any tool call that touches a sensitive data category or exceeds an authorized dollar amount. A task success score applies your predefined success criterion to the trace as a whole.

Prefactor computes these scores from the span data it receives, applying both model-based and rule-based evaluators depending on what you have configured. The scores attach to the trace, so you can filter and aggregate them.

Schema validation

Activity schemas define the expected behavior of an agent in a given context: which tools it may call, in what order, with what parameters. Validating a trace against its schema is different from scoring quality. Schema validation is a binary check: did the agent stay within the defined envelope? A trace that scores well on quality but fails schema validation has done something structurally unexpected, and that is worth knowing separately.

The audit trail

The audit trail is the immutable record of what happened, in sequence, for every run. It is what you show a customer who asks why their order was changed, or a regulator who asks what the agent decided and why. The audit trail is not a dashboard. It is a log of facts: this span executed at this time, with these inputs, and produced this output.

Notion's engineering team built and rebuilt their agent evaluation system through four or five complete iterations before shipping. They used observability tooling throughout to trace what the agent actually did, rather than what they expected it to do. That discipline is what allowed them to go from resolving 3 issues per day to 30 with their bug-triage agent.

Danfoss deployed an AI agent for B2B order management that now handles 80% of transactional decisions. That figure is credible to their operations team because they can trace each decision back to the agent's inputs and the rules it applied. Without that traceability, 80% would be an estimate.

Continuous evaluation versus launch-time evals

Launch-time evals test your agent against a fixed dataset before you ship. They are necessary but not sufficient. Production behavior diverges from your eval set for reasons that are hard to anticipate: users phrase requests differently, edge cases accumulate, upstream APIs change their response formats.

Continuous evaluation runs your scoring and validation logic against live production traces, on a schedule or in near-real-time. It does not require a separate eval dataset, though having one helps with regression testing. What it requires is that your spans and traces are being recorded, and that your scoring logic is running against them.

The practical architecture: instrument the agent to emit spans, route those spans to your evaluation platform, apply quality and risk scores, validate against activity schemas, and set threshold alerts so that a decline in task success rate or a spike in schema violations surfaces before users notice. You can read more about how these components fit together on the learn page.

The comparison between approaches is worth reviewing if you are choosing between evaluation tooling options. See the compare page for a breakdown.

Launch-time evals answer the question: is this agent ready to ship? Continuous evaluation answers the question: is this agent still working? Both questions need answers, on different schedules.

For teams building the business case, the problems page documents the categories of failures that continuous evaluation catches, with enough specificity to frame the conversation with a finance or legal stakeholder.

Where to start

Pick one agent in production, define its task success criterion in writing, and instrument it to record spans for one week. After that week, you will have enough data to compute a task success rate, identify the top three failure patterns, and set a baseline for drift. That is more evidence than most teams have after months of deployment.

Start evaluating your agents or read the docs to see how span instrumentation and scoring are configured.