Toxicity Detection
Detect toxic, hateful, and threatening content in LLM inputs and outputs.
Toxicity Detection
The Toxicity Detection provider identifies harmful language including hate speech, threats, insults, and other toxic content. It uses multi-label classification to flag one or more toxicity categories simultaneously.
What it detects
| Category | Examples |
|---|---|
| Hate speech | Slurs, dehumanization, group-targeted hostility |
| Threats | Direct or implied threats of violence |
| Insults | Personal attacks, name-calling, demeaning language |
| Obscenity | Gratuitously offensive language intended to shock |
| Severe toxicity | Content combining multiple toxic categories at high intensity |
A single input can trigger multiple categories — for example, a message may be both hateful and threatening.
Configuration
{
"policy_type": "toxicity",
"mode": "blocking",
"config": {
"threshold": 0.5
}
}| Parameter | Type | Default | Description |
|---|---|---|---|
threshold | float | 0.5 | Confidence threshold (0–1). Lower values catch more content but may increase false positives. |
Example violation
{
"allowed": false,
"violations": [
{
"policy_type": "toxicity",
"severity": "high",
"description": "Hate speech detected targeting ethnic group",
"labels": ["hate_speech", "severe_toxicity"],
"confidence": 0.92
}
]
}Best practices
- Start with the default threshold of
0.5and adjust based on your false-positive rate. - Use
mode: "warning"during initial rollout to monitor detections without blocking users. - Combine with the Sensitive Topics provider for comprehensive content safety coverage.
- Lower the threshold for customer-facing applications where brand safety is critical.
- Review flagged content in the dashboard to calibrate thresholds for your specific use case.