Toxicity Detection

The Toxicity Detection provider identifies harmful language including hate speech, threats, insults, and other toxic content. It uses multi-label classification to flag one or more toxicity categories simultaneously.

What it detects

Category	Examples
Hate speech	Slurs, dehumanization, group-targeted hostility
Threats	Direct or implied threats of violence
Insults	Personal attacks, name-calling, demeaning language
Obscenity	Gratuitously offensive language intended to shock
Severe toxicity	Content combining multiple toxic categories at high intensity

A single input can trigger multiple categories — for example, a message may be both hateful and threatening.

Configuration

{
  "policy_type": "toxicity",
  "mode": "blocking",
  "config": {
    "threshold": 0.5
  }
}

Parameter	Type	Default	Description
`threshold`	float	`0.5`	Confidence threshold (0–1). Lower values catch more content but may increase false positives.

Example violation

{
  "allowed": false,
  "violations": [
    {
      "policy_type": "toxicity",
      "severity": "high",
      "description": "Hate speech detected targeting ethnic group",
      "labels": ["hate_speech", "severe_toxicity"],
      "confidence": 0.92
    }
  ]
}

Best practices

Start with the default threshold of 0.5 and adjust based on your false-positive rate.
Use mode: "warning" during initial rollout to monitor detections without blocking users.
Combine with the Sensitive Topics provider for comprehensive content safety coverage.
Lower the threshold for customer-facing applications where brand safety is critical.
Review flagged content in the dashboard to calibrate thresholds for your specific use case.

Toxicity Detection

Toxicity Detection

What it detects

Configuration

Example violation

Best practices

On this page