← Back to glossary
Glossary

Toxicity Detection

Reviewed 20 March 2026 Canonical definition

Toxicity detection is the automated identification of harmful, offensive, or inappropriate content in model inputs or outputs. It is a common guardrail layer applied to both user-facing and agent-to-agent communications.