← Back to glossary
Glossary

Toxicity Detection

Reviewed 9 April 2026 Canonical definition

Toxicity detection is the automated identification of harmful, offensive, or inappropriate content in model inputs or outputs. It is a common guardrail layer applied to both user-facing and agent-to-agent communications.