Injection detection
Two-layer detection — synchronous regex blocking and async ML enrichment — covering OWASP Agentic Top 10 threat #1.
What is prompt injection?
Attacker text in the agent's context that hijacks its behaviour.
Prompt injection is the OWASP Agentic Top 10 threat #1. An attacker embeds instructions inside content the agent reads — user messages, retrieved documents, tool outputs, web pages — and those instructions override the agent's intended goals.
Two forms:
- ▸Direct — attacker controls the user-turn input directly. Simplest form; also the most common in agentic pipelines where end-users can free-type.
- ▸Indirect — malicious instructions are embedded in tool output or retrieved documents (web pages, emails, knowledge-base chunks). The agent reads the content as data but the model treats parts of it as instructions.
Two-layer detection architecture
Regex engine
30+ patterns across 6 attack categories, compiled into a single-pass scanner. Runs inline on every gateway request before the provider is called.
If the accumulated score reaches ≥ 70, the request is blocked immediately and a 403 is returned to the agent. The provider never sees it.
ML model
protectai/deberta-v3-base-prompt-injection-v2 — a DeBERTa model fine-tuned on injection datasets. Runs after the request is forwarded so it adds zero latency to the client.
If the model returns label: INJECTION with confidence > 0.85, the logged action is retroactively updated to blocked and an incident is created.
Attack categories
| Category | Examples |
|---|---|
| instruction_override | "ignore previous instructions", goal pivots, context resets |
| jailbreak | DAN, "developer mode", "no restrictions" |
| system_override | XML tag injection, <|im_start|> tokens, admin-prefix claims |
| data_exfil | credential extraction, "send to attacker" |
| encoded | base64, hex, ROT13, Unicode homoglyphs, zero-width chars |
| role_play | identity substitution, persona attacks |
Scoring model
Pattern scores accumulate (capped at 100) to produce a per-request verdict.
Each pattern match adds its assigned weight to the request score. Multiple patterns on the same request stack — a message with a goal-pivot phrase and a base64-encoded payload scores higher than either alone.
Session quarantine
A malicious verdict locks the session for 30 minutes.
When a request is classified as malicious, all subsequent requests sharing the same session_id are automatically flagged for human review for the next 30 minutes. This prevents an attacker from submitting a slightly rephrased variant immediately after the initial block.
You can clear a quarantine early from the dashboard Incidents panel or via the REST API.
ML enrichment pipeline
What happens after the gateway logs a request.
- 1Async
classifyInjection(text)call dispatched to the HuggingFace inference endpoint. - 2If
label === 'INJECTION'and score> 0.85: the stored action'sdecisionfield is updated toblockedand a new incident is created with severityhigh. - 3The fields
_ml_injection_scoreand_ml_modelare written to the action'smetadataobject and visible in the dashboard action detail view.
What it doesn't catch
Be aware of these blind spots.
Injection detection is defence-in-depth, not a guarantee. Pair it with least-privilege tool design, human-in-the-loop for high-risk actions, and output DLP (coming Phase 3).
Tuning with policies
Add deterministic rules on top of the ML layer.
The Policy DSL lets you write exact-match or regex rules that block specific phrases independently of the ML score. Useful when you know your threat surface (e.g., your agent always receives user input in messages[0].content).
Example — block override attempts
{
"name": "Block override attempts",
"action": "block",
"match": { "action_type": "model_call" },
"where": {
"inputs.messages.0.content": {
"regex": "ignore.{0,20}(instructions|rules)"
}
}
}This rule fires before the ML layer and blocks the request immediately, regardless of the injection score. Stack multiple policies to cover different vectors in your specific pipeline.