Prompt Injection
Detects instruction override and system prompt extraction attempts using multi-layer analysis.
Prompt Injection
Prompt Injection detection identifies attempts to override your system instructions or extract your system prompt. Using multi-layer detection combining heuristic patterns and ML classification, it catches both known attack templates and novel injection techniques before they reach your LLM.
What it detects
- Instruction override attempts ("ignore previous instructions")
- System prompt extraction ("repeat your system prompt")
- Role reassignment attacks ("you are now a different AI")
- Delimiter injection (fake system message boundaries)
- Indirect injection via embedded content
- Encoded or obfuscated injection payloads
Configuration
{
"policy_type": "prompt_injection",
"mode": "blocking",
"config": {
"detection_mode": "thorough",
"threshold": 0.80
}
}Example violation
{
"policy_type": "prompt_injection",
"severity": "high",
"description": "Instruction override attempt detected in user input",
"details": {
"attack_type": "instruction_override",
"confidence": 0.94,
"detection_layer": "ml_classifier"
}
}Best practices
- Use
thoroughmode for production systems handling untrusted user input - Use
fastmode for low-latency applications where speed is critical - Set threshold lower (0.70) during initial deployment to catch more attempts, then tune upward
- Combine with jailbreak detection for comprehensive prompt-level protection