Injection detection

Two-layer detection — synchronous regex blocking and async ML enrichment — covering OWASP Agentic Top 10 threat #1.

What is prompt injection?

Attacker text in the agent's context that hijacks its behaviour.

Prompt injection is the OWASP Agentic Top 10 threat #1. An attacker embeds instructions inside content the agent reads — user messages, retrieved documents, tool outputs, web pages — and those instructions override the agent's intended goals.

Two forms:

▸Direct — attacker controls the user-turn input directly. Simplest form; also the most common in agentic pipelines where end-users can free-type.
▸Indirect — malicious instructions are embedded in tool output or retrieved documents (web pages, emails, knowledge-base chunks). The agent reads the content as data but the model treats parts of it as instructions.

Two-layer detection architecture

Layer 1synchronous · <1ms

Regex engine

30+ patterns across 6 attack categories, compiled into a single-pass scanner. Runs inline on every gateway request before the provider is called.

If the accumulated score reaches ≥ 70, the request is blocked immediately and a 403 is returned to the agent. The provider never sees it.

Layer 2async · ~300ms

ML model

protectai/deberta-v3-base-prompt-injection-v2 — a DeBERTa model fine-tuned on injection datasets. Runs after the request is forwarded so it adds zero latency to the client.

If the model returns label: INJECTION with confidence > 0.85, the logged action is retroactively updated to blocked and an incident is created.

Attack categories

Category	Examples
instruction_override	"ignore previous instructions", goal pivots, context resets
jailbreak	DAN, "developer mode", "no restrictions"
system_override	XML tag injection, <\|im_start\|> tokens, admin-prefix claims
data_exfil	credential extraction, "send to attacker"
encoded	base64, hex, ROT13, Unicode homoglyphs, zero-width chars
role_play	identity substitution, persona attacks

Scoring model

Pattern scores accumulate (capped at 100) to produce a per-request verdict.

Each pattern match adds its assigned weight to the request score. Multiple patterns on the same request stack — a message with a goal-pivot phrase and a base64-encoded payload scores higher than either alone.

0 – 29cleanLogged only. Request proceeds normally.

30 – 69suspiciousLogged and ML scan queued. Request proceeds; incident created if ML confirms.

70+maliciousBlocked immediately on the gateway path. Provider never receives the request.

Session quarantine

A malicious verdict locks the session for 30 minutes.

When a request is classified as malicious, all subsequent requests sharing the same session_id are automatically flagged for human review for the next 30 minutes. This prevents an attacker from submitting a slightly rephrased variant immediately after the initial block.

You can clear a quarantine early from the dashboard Incidents panel or via the REST API.

ML enrichment pipeline

What happens after the gateway logs a request.

1Async classifyInjection(text) call dispatched to the HuggingFace inference endpoint.
2If label === 'INJECTION' and score > 0.85: the stored action's decision field is updated to blocked and a new incident is created with severity high.
3The fields _ml_injection_score and _ml_model are written to the action's metadata object and visible in the dashboard action detail view.

What it doesn't catch

Be aware of these blind spots.

▸Novel zero-day jailbreaks with no surface pattern — new attack strings that have never appeared in training data or pattern lists.

▸Low-resource language attacks — payloads written entirely in languages underrepresented in the ML model's training set.

▸Purely semantic attacks — carefully crafted benign-looking text that manipulates model behaviour without triggering any lexical or syntactic signal. No static scanner catches these reliably.

Injection detection is defence-in-depth, not a guarantee. Pair it with least-privilege tool design, human-in-the-loop for high-risk actions, and output DLP (coming Phase 3).

Tuning with policies

Add deterministic rules on top of the ML layer.

The Policy DSL lets you write exact-match or regex rules that block specific phrases independently of the ML score. Useful when you know your threat surface (e.g., your agent always receives user input in messages[0].content).

Example — block override attempts

{
  "name": "Block override attempts",
  "action": "block",
  "match": { "action_type": "model_call" },
  "where": {
    "inputs.messages.0.content": {
      "regex": "ignore.{0,20}(instructions|rules)"
    }
  }
}

This rule fires before the ML layer and blocks the request immediately, regardless of the injection score. Stack multiple policies to cover different vectors in your specific pipeline.

← Gateway Policy DSL →