Jailbreak Detection

Jailbreak Detection identifies attempts to circumvent your LLM's safety guardrails. It catches techniques like persona hijacking, safety bypass prompts, and refusal suppression that aim to make the model produce harmful or policy-violating content.

What it detects

Persona hijacking ("DAN" and similar unrestricted personas)
Safety bypass attempts (fictional framing to avoid restrictions)
Refusal suppression ("never say you can't")
Roleplay-based jailbreaks
Multi-turn escalation patterns
Hypothetical framing attacks

Configuration

{
  "policy_type": "jailbreak",
  "mode": "blocking",
  "config": {
    "threshold": 0.80
  }
}

Example violation

{
  "policy_type": "jailbreak",
  "severity": "high",
  "description": "Persona hijacking attempt detected",
  "details": {
    "attack_type": "persona_hijacking",
    "confidence": 0.92,
    "pattern": "unrestricted_persona"
  }
}

Best practices

Deploy alongside prompt injection detection for layered defense
Monitor flagged attempts to identify emerging jailbreak techniques
Set threshold at 0.80 to balance security with conversational flexibility
Review false positives from creative writing use cases and adjust accordingly

Jailbreak Detection

Jailbreak Detection

What it detects

Configuration

Example violation

Best practices

On this page