One call, one connected story — not three disconnected logs
A voice agent's real lifecycle spans more than the call itself: a conversational layer, a background agent that acts, and a workflow that runs after the call ends.
Built on LiveKit + Ultravox + ElevenLabs — a real design-partner deployment, anonymised.
- p95/p99 tracked per span type — latency, cost, and quality together
- Risk tracked as a trend, per agent layer, not just per call
- Business-logic-level spans capture intent, not just tool calls
Voice agents often split into a low-risk conversational layer and a higher-risk background agent that performs actions — plus a workflow that runs after the call hangs up. Prefactor gives each layer its own risk profile tracked over time, captures intent-level spans instead of raw tool calls, runs your own quality definitions as native evals, and links the whole chain — call, background agent, after-call workflow — back to one conversation.
The problem
A voice agent has two tiers — one that talks to the customer (read-only, can't do much damage) and one that acts on their behalf in the background (mutating, gated behind approval). Quality metrics are specific to the team — clarifying-question iterations, voice glitches, latency, hallucination — with no formal weighting between them today, cross-referenced manually against error monitoring and customer feedback about once a week. The call, the background agent, and whatever runs after the call ends get checked as three separate things instead of one connected story.
How it works in Prefactor
p95 and p99 are tracked per span type on more than latency — cost and quality too — so a degrading slice of calls shows up as a number before it shows up as a pattern of complaints.
Risk is tracked as a trend, per layer: the conversational agent scores low-risk (read-only); the background agent scores higher (it mutates data) — and both are tracked over time, so drift toward riskier behaviour in either layer shows up before an individual call looks alarming.
Business-logic-level spans capture what the agent was actually trying to do, which matters when the pipeline underneath changes but the intent doesn't.
Your quality definitions — iteration count, TTS glitch rate, latency, hallucination — plug in as native evals: Prefactor captures the run and attaches the result; you bring the scoring logic via API. The same pattern applies to LLM-as-judge.
One phone call, its background agent's actions, and the async after-call workflow — even across separate services — stay connected to the same instance.
Proactive alerts fire on a real pattern, like a spike in negative feedback, rather than a single thumbs-down.
Frequently asked questions
Do we need to change our voice pipeline to use this?
How is quality actually scored — do we have to accept a generic metric?
Can we link a call to something that happens after it ends?
Related glossary terms
See it on your own agents
Book a demo and we'll walk through one call, one connected story — not three disconnected logs on a fleet like yours — real frameworks, real traces.
Unified performance platform for agents, authentication, and risk management