Problems · in production

Your agents are failing right now. Nothing is telling you.

Teams run tens, sometimes thousands, of agents with no single view of what any of them did today.

Book a demo →View docs

fleet-view · unknown agentsexample

Illustrative fleet, the day one record replaces none

Agents registered12

Agents discovered running38

Runs evaluated today0 → 1,904

Failures found in week one23

Time to answer "what did it do?"days → minutes

✓ every run lands in one queryable record, whoever built the agent

§01 / THE SYMPTOMyou see: the signals

TL;DR

Agents fail quietly because a trace of a wrong outcome looks normal. Put every agent's runs in one record, evaluate each run against the agent's job, and hidden failures become visible verdicts you can act on.

The symptom

What flying blind looks like

None of these are rare. They are the default state of an agent fleet nobody instrumented.

Failures arrive as complaints

The first signal that an agent misbehaved is a user or a customer telling you, days after the run. The trace, if one exists, was never read.

Agents nobody registered

Teams ship agents without telling anyone. The count you report and the count actually running are different numbers, and nobody can say by how much.

Agent actions look like human actions

In your existing logs, an agent updating a record and a person updating a record are the same event. Incident response cannot tell which happened.

"What did it do?" takes days

When someone asks what an agent accessed, decided, or changed last Tuesday, the answer is a reconstruction from scattered logs, not a lookup.

§02 / WHY IT HAPPENScause: not carelessness

Why it happens

Why good teams end up here

Not carelessness. Each cause is a reasonable decision that compounds into a fleet nobody can see.

Every framework, its own telemetry

One team builds on LangChain, another on CrewAI, another on a vendor tool with closed internals. Each emits different telemetry to a different place, or none.

Observability stops at the trace

Where tracing exists, it records what happened and stops. A trace of a wrong answer looks identical to a trace of a right one, so nobody reads them until something breaks.

Shipping outruns instrumenting

The agent that proved useful in a week was never going to wait a quarter for logging standards. Multiply by every team that did the same.

No one owns the fleet

Each agent has an author. The fleet, the thing that fails as a whole, has no owner and no register, so nobody notices the gap widening.

§03 / HOW YOU CATCH ITloop: observe → evaluate

How you catch it

How one record finds them

Prefactor watches every run and evaluates the outcome, so a hidden failure becomes a visible verdict.

Connect

Instrument without re-architecting. Native SDKs for the frameworks you build on, a TypeScript and Python core SDK for anything else, and OpenTelemetry ingest for closed tools. No gateway in the request path.

Observe

Every run, every agent, one record. Actions, tool calls, and decisions land in one queryable place as they happen, tagged to the agent that did them, distinct from human activity.

Evaluate

A verdict on each run, not just a trace. Every run is checked against the agent's job: did it complete the task, at what quality, at what cost. Failures stop hiding inside normal-looking traces.

Surface

The unregistered agents show up. Anything emitting runs appears in the fleet view, so the number you report and the number running converge on the same number.

§04 / HOW YOU FIX ITloop: act → improve

How you fix it

From blind to answerable

Seeing the failures is the start. The loop keeps them from coming back.

Act

Failures route to a person while they are small. A run that fails its evaluation can be held, escalated, or flagged for review before the pattern reaches more users.

Improve

Fix the agent, not the symptom. The record shows which prompt edit, model change, or integration broke the behaviour, so the fix lands in the right place.

Prove

"What did it do?" becomes a lookup. Every action and decision per agent is queryable, so the question that used to take days takes minutes, with a record you can hand over.

A team believed it ran a dozen agents. Instrumenting through one record surfaced three times that number, including two agents still acting on a system that had been decommissioned upstream. Neither had failed loudly; both had been failing quietly for weeks. Illustrative, but this is the standard shape of week one.

§05 / WHO OWNS ITteams: the same record

Who owns it

The same problem, from every seat

Engineering leadership

Instrument the fleet you already run without re-architecting it, and stop learning about failures from other teams.

See the solution →Heads of AI

One portfolio view of every agent: its owner, its quality trend, its cost, whether it is doing its job.

See the solution →Security & governance

Agent actions separated from human actions, with a record you can put in front of an auditor.

See the solution →

§06 / QUESTIONSfaq: the common ones

Questions

How do I find AI agent failures in production?

Instrument every agent into one record and evaluate each run against the agent's job. Failures hide because traces of wrong outcomes look identical to traces of right ones; a per-run verdict is what makes them visible without a human reading every trace.

How do I know how many agents are actually running?

Route all agent telemetry through one place. Anything that emits runs appears in the fleet view, including agents nobody registered, so the reported count and the running count converge.

Do I have to rebuild my agents to get this?

No. There are native SDKs for common frameworks, a TypeScript and Python core SDK for anything custom, and OpenTelemetry ingest for closed tools. There is no gateway in the request path and no rebuild.

We already have tracing. Is that not enough?

Tracing records what happened; it does not judge it. A hidden failure is a run whose trace looks normal. Evaluation on every run is the layer that turns a normal-looking trace into a caught failure.

How is agent activity separated from human activity?

Every run is tagged to the agent that performed it, so an agent updating a record and a person updating a record are different events in the record, and incident response can tell them apart.

See it in action on a fleet like yours

Book a demo and we will put a fleet like yours in one record: every run watched, evaluated, and answerable.

Agent Performance Platform
Unified performance platform for agents, authentication, and risk management

All Systems Operational

3Global Agents

7Instances

5Services

12%Human Intervene

4High Risk

$2,360Monthly Spend

Mission ControlLive agent health with 7-day activity heartbeat

Claims Proc...68

$330/moRed

Claims Proc...65

$160/moRed

Claims Proc...82

$170/moAmber

ChatGPT74

$150/moAmber

Critical Alerts

Pending Review

Resolved Today

Total Actions

Operational Actions3

Access changes, policy reviews, workflow drift

Risk Actions5

Sensitive data exposure, unsafe actions, access violations

Unauthorized access to financial database

Riskcriticalresolved

Agent attempted to access Finance-MCP server without proper authorization.

Claims Processor v1.0Finance-MCP03/02/2025, 14:32:00

Action taken by Prefactor:

Prefactor revoked the agent’s active session and blocked further access attempts.

✅ Resolved by Security Team at 03/02/2025, 14:45:00

Event LogSecurityPermissions

Mcp CallEVT-001 • 02/10/2025, 17:30:22Message sent to Slack channel200ms · $0.0500

PromptEVT-002 • 02/10/2025, 17:25:15User prompt received337ms · $0.1200

Mcp CallEVT-003 • 02/10/2025, 17:26:42Retrieved repository information474ms · $0.0800

Tool CallEVT-004 • 02/10/2025, 17:28:10Analyzed code for security issues611ms · $0.1500

OutcomeEVT-005 • 02/10/2025, 17:29:55Analysis complete748ms · $0.0300

See how every agent performs, and make it better

Prefactor helps teams observe, evaluate, and improve their AI agents in production, across every framework and provider.

Book a demo View docs