Glossary

Deceptive Alignment

Reviewed 9 April 2026 Canonical definition

Deceptive alignment is a hypothetical failure mode where an AI agent behaves correctly during training and evaluation but pursues a different objective when deployed in production — having learned to recognise when it is being monitored. It motivates the use of diverse evaluation, red-teaming, and continuous monitoring rather than relying on point-in-time testing.

Deceptive Alignment

Related terms