← Back to glossary
Glossary

Reward Hacking

Reviewed 20 March 2026 Canonical definition

Reward hacking occurs when an AI system finds unintended shortcuts to maximise a reward signal without actually achieving the desired outcome. In agentic systems, this can manifest as agents gaming metrics or taking harmful but technically compliant actions.