Indirect prompt injection defense for AI agents using tool calls
Project description
Indirect prompt injection defense for AI agents using tool calls (MCP, CLI, or direct APIs). Detects and neutralizes attacks hidden in tool results (emails, documents, PRs, etc.) before they reach your LLM.
Python package: stackone-defender — aligned with @stackone/defender on npm.
Installation
pip
pip install stackone-defender
uv
uv add stackone-defender
Tier 2 (ONNX) — add extras:
pip install stackone-defender[onnx]
# or: uv add "stackone-defender[onnx]"
The ONNX model (~22MB) is bundled in the wheel — no extra downloads at runtime.
SFE preprocessor (optional) — add extras:
pip install stackone-defender[sfe]
# or: uv add "stackone-defender[sfe]"
The [sfe] extra installs fasttext-ng (provides the fasttext module). It requires NumPy 2.3+. PyPI may ship a wheel only for some platforms; otherwise pip/uv builds from source (needs a C++ toolchain).
Quick start
from stackone_defender import create_prompt_defense
# Tier 1 + Tier 2 are on by default. block_high_risk=True enables allow/block.
defense = create_prompt_defense(block_high_risk=True)
# Optional: preload ONNX to avoid first-call latency (requires [onnx] extra)
defense.warmup_tier2()
result = defense.defend_tool_result(tool_output, "gmail_get_message")
if not result.allowed:
print(f"Blocked: risk={result.risk_level}, score={result.tier2_score}")
print(f"Detections: {', '.join(result.detections)}")
else:
send_to_llm(result.sanitized)
How it works
defend_tool_result() runs two tiers:
Tier 1 — Pattern detection (sync, ~1 ms)
- Unicode normalization — homoglyph resistance (e.g. Cyrillic
а→ ASCIIa) - Role stripping —
SYSTEM:,ASSISTANT:,<system>,[INST], etc. - Pattern removal — phrases like “ignore previous instructions”
- Encoding detection — suspicious Base64/URL-shaped payloads
- Boundary annotation —
[UD-{id}]…[/UD-{id}]wrappers around untrusted spans
Tier 2 — ML classification (ONNX)
Packed-chunk MiniLM classifier (int8 ONNX ~22 MB, bundled):
- Split text into sentences, pack to model-sized chunks, score chunks in batched ONNX calls
- Catches paraphrased or novel injections missed by regex
- Uses chunked batch inference to bound memory on large payloads
Optional SFE preprocessor
use_sfe=Trueenables a field-level FastText pass before Tier 1/Tier 2- Drops metadata-like leaves (IDs, enum-like strings) and keeps user-facing content
- Fails open if the runtime/model is unavailable: payload continues unfiltered
Benchmarks (F1 @ threshold 0.5):
| Benchmark | F1 | Samples |
|---|---|---|
| Qualifire (in-distribution) | 0.8686 | ~1.5k |
| xxz224 (out-of-distribution) | 0.8834 | ~22.5k |
| jayavibhav (adversarial) | 0.9717 | ~1k |
| Average | 0.9079 | ~25k |
allowed vs risk_level
- Use
allowedfor gating whenblock_high_risk=True:Falsemeans do not passsanitizedto the model as-is. risk_levelis diagnostic: it starts atdefault_risk_level(default"medium") and is escalated by Tier 1 / Tier 2 signals — not reduced. Use it for logging, not as the sole block signal unless you implement your own policy.
| Level | Typical trigger |
|---|---|
low |
No strong signals |
medium |
Lighter pattern / sanitization signals |
high / critical |
Strong injection patterns, encoding signals, or high Tier 2 score |
API
create_prompt_defense(**kwargs)
defense = create_prompt_defense(
enable_tier1=True,
enable_tier2=True,
block_high_risk=False,
default_risk_level="medium",
tier2_fields=["subject", "body", "snippet"], # optional: scope Tier 2 to these JSON keys
use_sfe=True, # optional: enable semantic field extractor preprocessing
config={
"tier2": {
"high_risk_threshold": 0.8,
"tier2_fields": None, # or list[str]; constructor tier2_fields wins if set
},
},
)
defense.defend_tool_result(value, tool_name)
Runs Tier 1 sanitization on risky fields, then Tier 2 on extracted text (with optional field scoping). Synchronous — no await.
from dataclasses import dataclass, field
@dataclass
class DefenseResult:
allowed: bool
risk_level: RiskLevel
sanitized: Any
detections: list[str]
fields_sanitized: list[str]
patterns_by_field: dict[str, list[str]]
tier2_score: float | None = None
tier2_skip_reason: str | None = None
max_sentence: str | None = None
fields_dropped: list[str] = field(default_factory=list)
truncated_at_depth: bool | None = None
latency_ms: float = 0.0
defense.defend_tool_results(items)
results = defense.defend_tool_results([
{"value": email_data, "tool_name": "gmail_get_message"},
{"value": doc_data, "tool_name": "documents_get"},
{"value": pr_data, "tool_name": "github_get_pull_request"},
])
for r in results:
if not r.allowed:
print("Blocked:", ", ".join(r.fields_sanitized))
defense.analyze(text)
Tier 1 only — useful for debugging pattern hits without full tool-result traversal.
Tier 2 warmup
defense = create_prompt_defense()
defense.warmup_tier2() # no-op if enable_tier2=False or ONNX extra missing
Integration example
from stackone_defender import create_prompt_defense
defense = create_prompt_defense(block_high_risk=True)
defense.warmup_tier2()
def run_tool_and_defend(raw_result: dict, tool_name: str):
outcome = defense.defend_tool_result(raw_result, tool_name)
if not outcome.allowed:
return {"error": "Content blocked by safety filter", "risk_level": outcome.risk_level}
return outcome.sanitized
# Example agent loop
sanitized = run_tool_and_defend(gmail_api.get_message(msg_id), "gmail_get_message")
Risky field detection
Only string values under configured “risky” keys are scanned and sanitized. RiskyFieldConfig provides global names/patterns plus tool_overrides (wildcard tool names → field list), same idea as the npm package.
| Tool pattern | Scanned fields |
|---|---|
gmail_*, email_* |
subject, body, snippet, content |
documents_* |
name, description, content, title |
github_* |
name, title, body, description, message |
hris_* |
name, notes, bio, description |
ats_* |
name, notes, description, summary |
crm_* |
name, description, notes, content |
Otherwise the default list applies: name, description, content, title, notes, summary, bio, body, text, message, comment, subject, plus suffix patterns like *_body, *_description, etc. Structural keys such as id, url, created_at are not treated as risky by default.
Development
uv sync --group dev
uv run pytest
License
Apache-2.0 — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stackone_defender-0.6.2.tar.gz.
File metadata
- Download URL: stackone_defender-0.6.2.tar.gz
- Upload date:
- Size: 34.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2db6c4ac875eb2aca8283abd843637c2afce13c06923795533250d6d6ed88a63
|
|
| MD5 |
1ae677ee5a24c509366ff22d714363ad
|
|
| BLAKE2b-256 |
56e4b58ee79484115efe098899e8e0fd9dd5ab1d669fbd07c50f5860a091e933
|
File details
Details for the file stackone_defender-0.6.2-py3-none-any.whl.
File metadata
- Download URL: stackone_defender-0.6.2-py3-none-any.whl
- Upload date:
- Size: 18.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f37b10a65a25de0452762e168d6d8bd91408d6e8e133d6fd35059ae829bfbbf8
|
|
| MD5 |
277e4d02e1176238d62c326974216e16
|
|
| BLAKE2b-256 |
2ebc95921c3b200dd189574c2997f02fbd7be8cafc39ea73cb1ef662671480d6
|