Jailbreak Detection
Detects persona hijacking, safety bypass, and refusal suppression attempts.
Jailbreak Detection
Jailbreak Detection identifies attempts to circumvent your LLM's safety guardrails. It catches techniques like persona hijacking, safety bypass prompts, and refusal suppression that aim to make the model produce harmful or policy-violating content.
What it detects
- Persona hijacking ("DAN" and similar unrestricted personas)
- Safety bypass attempts (fictional framing to avoid restrictions)
- Refusal suppression ("never say you can't")
- Roleplay-based jailbreaks
- Multi-turn escalation patterns
- Hypothetical framing attacks
Configuration
{
"policy_type": "jailbreak",
"mode": "blocking",
"config": {
"threshold": 0.80
}
}Example violation
{
"policy_type": "jailbreak",
"severity": "high",
"description": "Persona hijacking attempt detected",
"details": {
"attack_type": "persona_hijacking",
"confidence": 0.92,
"pattern": "unrestricted_persona"
}
}Best practices
- Deploy alongside prompt injection detection for layered defense
- Monitor flagged attempts to identify emerging jailbreak techniques
- Set threshold at 0.80 to balance security with conversational flexibility
- Review false positives from creative writing use cases and adjust accordingly