Detects instruction override and system prompt extraction attempts using multi-layer analysis.

Prompt Injection

Prompt Injection detection identifies attempts to override your system instructions or extract your system prompt. Using multi-layer detection combining heuristic patterns and ML classification, it catches both known attack templates and novel injection techniques before they reach your LLM.

What it detects

Instruction override attempts ("ignore previous instructions")
System prompt extraction ("repeat your system prompt")
Role reassignment attacks ("you are now a different AI")
Delimiter injection (fake system message boundaries)
Indirect injection via embedded content
Encoded or obfuscated injection payloads

Configuration

{
  "policy_type": "prompt_injection",
  "mode": "blocking",
  "config": {
    "detection_mode": "thorough",
    "threshold": 0.80
  }
}

Example violation

{
  "policy_type": "prompt_injection",
  "severity": "high",
  "description": "Instruction override attempt detected in user input",
  "details": {
    "attack_type": "instruction_override",
    "confidence": 0.94,
  }
}

Best practices

Use thorough mode for production systems handling untrusted user input
Use fast mode for low-latency applications where speed is critical
Set threshold lower (0.70) during initial deployment to catch more attempts, then tune upward
Combine with jailbreak detection for comprehensive prompt-level protection

Prompt Injection

Prompt Injection

What it detects

Configuration

Example violation

Best practices

On this page