Glossary

Jailbreak (AI Agent)

Reviewed 9 April 2026 Canonical definition

A jailbreak is a technique used to bypass an AI agent's safety guardrails or governance constraints — typically through crafted prompts, role-play framings, or instruction injections that cause the agent to ignore its system prompt or policy rules. Unlike external attacks, jailbreaks often originate from end users attempting to expand what the agent will do. Runtime policy enforcement provides a defence layer that operates independently of the model's own safety training.

Jailbreak (AI Agent)

Related articles

Related terms