← Back to glossary
Glossary

Site Reliability Engineering (AI)

Reviewed 9 April 2026 Canonical definition

Site reliability engineering for AI applies SRE principles — error budgets, service level objectives, toil reduction, and blameless post-mortems — to the operation of AI agent systems. It involves defining what 'reliable' means for agents (latency, accuracy, cost, compliance), measuring performance against those objectives, and using the error budget to balance stability against the pace of new agent deployments.