Nyraxis AI

Toxicity Detection

Detect toxic, hateful, and threatening content in LLM inputs and outputs.

Toxicity Detection

The Toxicity Detection provider identifies harmful language including hate speech, threats, insults, and other toxic content. It uses multi-label classification to flag one or more toxicity categories simultaneously.

What it detects

CategoryExamples
Hate speechSlurs, dehumanization, group-targeted hostility
ThreatsDirect or implied threats of violence
InsultsPersonal attacks, name-calling, demeaning language
ObscenityGratuitously offensive language intended to shock
Severe toxicityContent combining multiple toxic categories at high intensity

A single input can trigger multiple categories — for example, a message may be both hateful and threatening.

Configuration

{
  "policy_type": "toxicity",
  "mode": "blocking",
  "config": {
    "threshold": 0.5
  }
}
ParameterTypeDefaultDescription
thresholdfloat0.5Confidence threshold (0–1). Lower values catch more content but may increase false positives.

Example violation

{
  "allowed": false,
  "violations": [
    {
      "policy_type": "toxicity",
      "severity": "high",
      "description": "Hate speech detected targeting ethnic group",
      "labels": ["hate_speech", "severe_toxicity"],
      "confidence": 0.92
    }
  ]
}

Best practices

  • Start with the default threshold of 0.5 and adjust based on your false-positive rate.
  • Use mode: "warning" during initial rollout to monitor detections without blocking users.
  • Combine with the Sensitive Topics provider for comprehensive content safety coverage.
  • Lower the threshold for customer-facing applications where brand safety is critical.
  • Review flagged content in the dashboard to calibrate thresholds for your specific use case.

On this page