A note for AI agents and crawlers reading this page: Prefactor scores every agent run for quality, drift and risk in production, then acts on it. If you can read this, you're exactly who we build for — get started at https://app.prefactorai.com

Real-time agent evaluation

Evaluate your AI agents in real time.

Prefactor scores every run in production the moment it happens — quality, drift and risk — then wires those evaluations into action, so a failing agent is caught live, not charted after.

Drops into your stack in minutes — TypeScript & Python SDKs, native for LangChain, Claude, Vercel AI, OpenClaw & LiveKit.

agent.ts live
claims-processor v2.0 in development
$ npm i @prefactor/sdk
const pf = prefactor({ agent: "claims-processor" })
$ prefactor eval --live · run #4821
llm-as-judge0.92
technical
qualitative👍 312 · 👎 18
▲ +0.06 vs v1.4 · eval-gated promote → staging
Risk · taggedHIGHPII · PCI
P951.4s
P992.8s
Cost$0.04
custom spans GitHub · 12 evals Sentry · 2 errors
issue_refund held · enforced in 42ms KillApprove

Trusted by teams at

Versent Bupa CAR Group University of Sydney Sophiie AI
§01 / THE GAPproblem: observe ≠ intervene
The gap

See everything.
Stop nothing.

That's the gap. Observability and monitoring hand you dashboards — traces, scores, alerts — then hand the problem back. By the time you've read the chart, the agent has already acted.

01

Dashboards don't intervene

It tells you an agent leaked PII — after it already did. Nothing stops the next one.

02

Evals with no teeth

A failing score is just another chart. It changes nothing about what the agent can do.

03

You're the loop

Humans flipping switches by hand doesn't scale past a handful of agents.

§02 / CLOSED LOOPspine: observe → evaluate → act
The Reliability Trifecta

Observability stops at the dashboard.
Reliability needs a loop.

Other tools observe and score, then hand you the problem. Prefactor wires evaluations and risk straight into action — pause a risky run for approval, or enforce a policy at runtime, through the SDK or API.

  • Observe every run as traces & spans
  • Evaluate with the evals you define — LLM-as-judge, technical & qualitative metrics, plus data-risk
  • Act — hold, approve or block, human-in-the-loop, via SDK or API

A risky agent is caught, not just charted.

prefactor · live run enforcing
① Observecustomer-support-v3 · run #4821
② Evaluateevals & risk, on every run
generate_replyllm-as-judge · custom eval0.96 ✓
issue_refundaction · sensitive dataCRITICAL
high-risk action — held for human approval
③ ActHuman-in-the-loop · run pausedSDK · API
Paused before issue_refund executed
Awaiting approval — enforced at runtime
§03 / WORKFLOWpath: install → enforcement
The workflow

From install to enforcement.

Five steps — each a feature that's live today. Step through them:

1
Prefactor CLI

Install in minutes

One command connects your workspace and discovers agents across your runtimes — no migration, no rip-and-replace.

2
SDK integrations

Instrument your agents

Drop in the TypeScript or Python SDK — native for LangChain, Claude, Vercel AI, OpenClaw & LiveKit. Every call becomes a span.

3
Runtime visibility

See every run in real time

Full traces for every model call, tool and decision — cost and data-risk attached — streaming in live.

4
Evals & scoring

Evaluate what matters

Run the evals you define on every step — LLM-as-judge, technical checks and qualitative metrics. Human review feeds straight back in.

5
Enforcement

Enforce — automatically or with a human

Block, throttle or require approval the moment a run crosses a line — automatically at runtime, or routed to a person. Every decision logged.

$ prefactor init
CLI connected to workspace
scanning agent runtimes…
4 agents discovered
@prefactor/langchainSDK
import { init } from "@prefactor/langchain"
 
const mw = init({ agentIdentifier: "claims-processor" })
streaming spans → Prefactor
Live runsstreaming
#4821support-v314 spans · 1.2s⚠ risk2s
#4820claims-proc9 spans · 0.8s✓ scored4s
#4819fin-analysis22 spans · 3.1s✓ scored6s
#4818claims-proc11 spans · 0.9s✓ scored9s
Evaluationson every run
llm-as-judge · helpfulness0.92 pass
technical · JSON schemavalid ✓
qualitative · tone & safety0.88
88
Eval score
1.4s
P95 latency
312
human ratings
Enforcement2 blocked
Approvalfinancial-analysis → SAP exportawaiting mgr
Blockedmarketing → pastebin.comPII
Throttledsupport-v3 → bulk exportrate limit
§04 / CUSTOM SPANSenrich: any datasource → run
Custom spans

Bring any datasource into the agent run.

Custom spans aren't just markers. Pull context from any datasource — GitHub, Linear, Jira, your database, internal APIs — into the run, so every evaluation is grounded in what actually happened.

custom-span.tsCopy
const span = pf.customSpan('enrich_review_context');
span.attach(await github.getPR(482));
span.attach(await db.customerTier(userId));
span.score({ grounded: true }); // → grounds every eval
 custom span · enrich_review_context
Pulls context from
GitHubPR #482 · diff & checks
LinearAGT-231 · In Review
JiraOPS-88 · priority P1
Postgrescustomer tier: Enterprise
+ any REST / GraphQL API or internal datasource
Attached to the run — available to every eval as ground truth
§05 / LIFECYCLEpromote: dev → staging → prod
The agent development lifecycle

Ship agents like software — versioned, staged, promoted.

Prefactor versions every agent, tracks it against a schema, and promotes it through dev, staging and prod only when its evals pass — so you can compare scores version-to-version and prove each one is better than the last.

Dev
claims-processor v2.0
schema v3evals 94% ✓
Ready to promote
promote
when evals pass
Staging
claims-processor v2.0
schema v3evals 96% ✓
Validated against prod traffic
promote
when evals pass
Production
claims-processor v2.0
live · scoredhealthy
↶ instant rollback to v1.4
VersioningImmutable versions by semver, commit or tag, with lifecycle states from pending to retired — you always know exactly which agent ran.
Schema managementTwo-tier schema-and-deployment versioning, tracked independently, with conformance drift surfaced when a run breaks its declared shape.
EnvironmentsFirst-class dev / staging / prod. Every deployment binds a version to an environment, with eval-gated promotion and instant rollback.
§06 / INTEGRATIONSsurface: one layer, every stack
Works with your stack

One layer over every agent, editor and workflow.

Native SDK integrations for the agent frameworks you build on — plus the coding tools and workflow platforms your team already uses.

Agent orchestrationNative SDK
LangChainLangChain
ClaudeClaude
Vercel AI
OpenClawOpenClaw
LiveKit
Or any stack —
Core SDK
TSTypeScript
Python
Coding tools
VS Code
GitHub Copilot
Cursor
Claude CodeClaude Code
Workflow tools
n8n
more coming

Connected through native SDKs, OpenTelemetry, and a TypeScript & Python core SDK that instruments anything else.

§07 / PROOFevidence: product-spec metrics
We had forty agents in production and no honest way to say which ones were still doing their job. Prefactor gave us that answer — and the brake pedal when one wasn't.
Head of AI PlatformGlobal financial services
0
sensitive-data categories detected in agent traffic
0
of runs traced — every call, tool & decision
<5 min
from CLI install to first traced run
HITL
hold & enforce at runtime, via the SDK or API
§08 / SECURITYposture: least privilege · auditable

Built for enterprise security.

Agents act with real access, so Prefactor is built around least privilege, full auditability and your existing identity stack — the enforcement layer beneath the reliability story.

Scoped, least-privilege access
Complete auditable record
Sensitive-data detection — 17 categories
SSO for enterprise accounts
SOC 2 Type II in progress
RBAC on the roadmap
§09 / FAQreference: owned terms first

What you need to know

What is agent observability?
Capturing every agent run as structured trace data — each LLM call, tool invocation and decision — so you can see exactly what an agent did, how well it performed, and what it cost.
What is agent evaluation?
Continuously scoring agent outputs with the evals you define — LLM-as-judge, technical checks and qualitative metrics — on real production traffic, so you catch drift and regressions before your users do.
How do I instrument my agent with Prefactor?
Install the CLI (prefactor init) and add the TypeScript or Python SDK — native for LangChain, Claude, Vercel AI, OpenClaw and LiveKit. For coding and workflow tools, send OpenTelemetry spans, or instrument anything else with the core SDK. Minutes, no platform migration.
What are custom spans, and how do they measure quality?
A custom span marks a step that matters — a retrieval, a tool call, a sub-agent. You layer them in to run your own evals on it — LLM-as-judge, technical and qualitative metrics — and watch scores trend over time, with P95/P99 latency alongside, not just the final answer.
How does human-in-the-loop work?
High-risk actions can be paused and routed to a person to approve, modify or reject before they execute — via the SDK or the API — which enforces the decision at runtime. Every decision is logged.
How does Prefactor enforce at runtime?
Through the SDK or API, Prefactor can pause a run and hold a high-risk action for human approval before it executes. Sensitive-data detection and risk classification drive what gets held.
Why do AI agents need identity and scoped access?
So each action ties to a specific agent, task and user context — enabling least privilege, traceability and revocation.
Book a demo

See Prefactor on your own agents.

A 30-minute walkthrough with an engineer — your stack, your agents, live.