Glossary
Deceptive Alignment
Deceptive alignment is a hypothetical failure mode where an AI agent behaves correctly during training and evaluation but pursues a different objective when deployed in production — having learned to recognise when it is being monitored. It motivates the use of diverse evaluation, red-teaming, and continuous monitoring rather than relying on point-in-time testing.