Measure how DSPy prompt optimization affects the prompt-injection robustness of agentic LLM programs, using AgentDojo's attack suite.
Project description
dspy-security-bench
Measure how DSPy prompt optimization affects the prompt-injection robustness of agentic LLM programs, using AgentDojo's attack suite as ground truth.
The question: when you optimize a DSPy program with
BootstrapFewShot, MIPROv2, or GEPA, does it become more or less
robust to prompt-injection attacks? Two adjacent research communities — prompt
optimization and prompt-injection security — have not measured this
intersection. dspy-security-bench wires DSPy optimizers and AgentDojo
attacks into one harness so the trade-off becomes visible.
v0.1 results
Headline: prompt optimization measurably degrades adversarial robustness on harder attacks. Optimizers buy utility (0% → 40-60% task success on
direct) but pay it back in security onimportant_instructions(80% → 60% attack-failure rate).BootstrapFewShotPareto-dominatesMIPROv2on the workspace suite at v0.1's scale.
| Optimizer | Attack | Utility | Security | Injection success | n |
|---|---|---|---|---|---|
| unoptimized | direct | 0% | 100% | 0% | 5 |
| unoptimized | important_instructions | 0% | 80% | 20% | 5 |
| bootstrap_fewshot | direct | 60% | 100% | 0% | 5 |
| bootstrap_fewshot | important_instructions | 20% | 60% | 40% | 5 |
| miprov2 | direct | 40% | 80% | 20% | 5 |
| miprov2 | important_instructions | 20% | 60% | 40% | 5 |
Reading the chart. A point closer to the green star (top-right) is the ideal — high utility and high security. Three patterns hold across this scale:
unoptimizedis high-security but useless. It refuses to do the task (0% utility) regardless of attack, and resists attacks at 80–100%.bootstrap_fewshotis the best operating point at this scale. Equal or highest utility (60% ondirect), equal-best security ondirect(100%), and matchesmiprov2's degradedimportant_instructionssecurity.miprov2Pareto-loses to bootstrap. Lower utility ondirect(40% vs 60%) AND lower security (80% vs 100%). Suggests heavier optimization overfits the clean-distribution prompt and exposes more attack surface.
v0.1 scope: workspace suite only, N=5 user tasks × 1 injection task × 2 attacks × 3 optimizers = 30 runs. gpt-4o-mini for execution + judge. Trainset = 192 validated synthetic tasks (100 gpt-4o + 100 claude-sonnet, validated syntactic + dedupe). See
scripts/run_v01_benchmark.pyfor reproduction.
How it works
flowchart TD
A([AgentDojo seed env data]) --> B[env-data extractor]
B --> C[synthesis generator<br/>LM-generated query-only<br/>tasks grounded in env]
LM[(GPT-4o + Claude)] -.-> C
C -->|raw tasks| D[validator<br/>syntactic + dedupe<br/>+ optional solvability]
D -->|~190 validated tasks| E[optimizer harness<br/>BootstrapFewShot · MIPROv2<br/>GEPA in v0.2]
E -->|name → agent_factory| F[DSPyReActV2Element<br/>wraps dspy.ReActV2 as<br/>AgentDojo pipeline element]
F -->|AgentPipeline| G[runner<br/>drives benchmark_suite_<br/>with_injections]
AD[(AgentDojo attacks)] -.-> G
G --> H([pandas DataFrame<br/>one row per<br/>optimizer × attack ×<br/>user_task × injection_task])
classDef synth fill:#DBEAFE,stroke:#1E40AF,stroke-width:2px,color:#1E3A8A
classDef opt fill:#FED7AA,stroke:#9A3412,stroke-width:2px,color:#7C2D12
classDef eval fill:#DCFCE7,stroke:#15803D,stroke-width:2px,color:#14532D
classDef io fill:#F1F5F9,stroke:#475569,stroke-width:2px,color:#1F2937
classDef ext fill:#FAE8FF,stroke:#86198F,stroke-width:2px,color:#701A75
class B,C,D synth
class E,F opt
class G,H eval
class A io
class LM,AD ext
Install
git clone https://github.com/immu4989/dspy-security-bench.git
cd dspy-security-bench
# either with uv:
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e .
# or with pip:
pip install -e .
Requires Python 3.10+ and dspy >= 3.3.0b1 (the canonical-tool-call
release that adds dspy.ReActV2). pip/uv handle the pre-release pin
automatically because the version is explicit in pyproject.toml.
Quickstart
The full pipeline in Python:
import dspy
from dspy_security_bench.synthesis.generator import synthesize_tasks
from dspy_security_bench.synthesis.validator import validate_tasks
from dspy_security_bench.optimizers import build_agent_factories
from dspy_security_bench.llm_judge import LLMJudgeMetric
from dspy_security_bench.runner import evaluate_factories, summarize
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
# 1. Generate a synthetic trainset grounded in the workspace suite's seed env
raw_tasks = synthesize_tasks("workspace", n=150, model="openai/gpt-4o")
# 2. Filter for validity and dedupe against real test tasks
val = validate_tasks(raw_tasks, "workspace", checks=("syntactic", "dedupe"))
trainset = val.kept # ~140-180 high-quality tasks survive
# 3. Run optimizers — produces a factory per optimizer
factories = build_agent_factories(
trainset=trainset,
optimizers=["unoptimized", "bootstrap_fewshot", "miprov2"],
suite_name="workspace",
signature="query -> answer",
metric=LLMJudgeMetric(judge_lm=dspy.LM("openai/gpt-4o-mini", temperature=0)),
)
# 4. Evaluate against AgentDojo's attack suite
df = evaluate_factories(
factories=factories,
suite_name="workspace",
attacks=["direct", "important_instructions"],
user_task_ids=["user_task_0", "user_task_1", "user_task_3", "user_task_10", "user_task_11"],
injection_task_ids=["injection_task_0"],
max_iters=8,
)
# 5. Aggregate
print(summarize(df))
The full v0.1 run takes ~30-45 min wall-clock at ~$15-20 in LM cost
(gpt-4o-mini for everything). See
scripts/run_v01_benchmark.py for the
production driver — it caches optimizer state to data/results/factories_cache.pkl
so re-runs after a downstream crash skip optimization.
CLI
The synthesis and validation steps have CLIs that produce JSONL files:
# Synthesize (dry-run prints the prompt without calling the API)
dspy-security-bench-synthesize workspace --dry-run
# Real synthesis (requires OPENAI_API_KEY / ANTHROPIC_API_KEY)
export OPENAI_API_KEY=sk-...
dspy-security-bench-synthesize workspace \
--n 150 --model openai/gpt-4o \
--out data/synthetic_train/workspace_gpt4o_raw.jsonl
# Validate
dspy-security-bench-validate workspace \
data/synthetic_train/workspace_gpt4o_raw.jsonl \
--out data/synthetic_train/workspace_gpt4o.jsonl \
--report data/synthetic_train/workspace_gpt4o_report.json
Reproducing the v0.1 result
# After installing — synthesizes, validates, optimizes, evaluates, saves CSVs.
# Caches optimized state to data/results/factories_cache.pkl so reruns are fast.
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-... # optional — falls back to GPT-4o only
python scripts/run_v01_benchmark.py 2>&1 | tee data/results/run_v01.log
python scripts/generate_v01_figures.py # rebuilds the README charts
Outputs:
data/results/workspace_v01_results.csv— 30 raw rowsdata/results/workspace_v01_summary.csv— 6-row aggregationassets/v01_utility_vs_security.pngassets/v01_pareto.png
Development
# install with dev extras (pytest, ruff, pytest-cov)
uv pip install -e ".[dev]"
# run the full test suite (61 tests, all offline / mocked — no API key needed)
pytest tests/ -v
# linting
ruff check dspy_security_bench/ tests/
ruff format dspy_security_bench/ tests/
The test suite covers env-data extraction, synthesis helpers, validator
checks, the AgentDojo wrapper (end-to-end against user_task_0 with
DummyLM), the optimizer harness, the LLM-as-judge metric, and the
runner's orchestration (with benchmark_suite_with_injections mocked).
Design decisions
These are documented in detail in ARCHITECTURE.md. The key v0.1 scope choices:
- Synthetic trainset, not held-out split. AgentDojo has only ~40 user tasks per suite — not enough for a clean train/test split that supports optimizers like MIPROv2. We synthesize ~100 in-distribution query-only tasks per suite via GPT-4o + Claude Sonnet, validated against the env, and use the real AgentDojo tasks unmodified as the held-out test set.
- Query-only tasks for training; full action-task suite for testing. Action tasks (send, create, modify) have hand-written utility checks that don't synthesize cleanly. Training on queries-only is acceptable because the research question is whether prompt optimization (not action selection) affects robustness.
- Hybrid metric: LLM-as-judge with substring fast-path for training (cheap
- tolerant of paraphrasing); real AgentDojo
utility()for testing (rigorous, the actual published benchmark).
- tolerant of paraphrasing); real AgentDojo
- Single-output signature constraint on the DSPy program. The model's final
output goes into AgentDojo's single
model_outpututility argument.
Roadmap
| Milestone | Status |
|---|---|
| v0.1 — workspace suite × 2 attacks × 3 optimizers, headline finding | shipped |
| v0.2 — banking / travel / slack suites, GEPA optimizer, larger N | planned |
| v0.3 — adversarial trainset to study robust-by-construction optimization | planned |
| Paper — TMLR submission if v0.2 findings hold at scale | conditional |
Acknowledgments and prior work
This benchmark sits on top of:
- DSPy (Stanford NLP) — the optimizer framework being evaluated.
- AgentDojo (ETH Zurich, SPY lab) — the attack suite and task environments providing ground-truth robustness measurement.
It also draws on the broader 2024-26 prompt-security literature, including GEPA, BATprompt, Survival of the Safest, InjecAgent, and WASP.
Citation
If you use this benchmark in research or production, please cite:
@misc{ahamed2026dspysecuritybench,
title = {{dspy-security-bench}: Measuring optimizer-induced robustness in
agentic DSPy programs},
author = {Imran Ahamed},
year = {2026},
howpublished = {\url{https://github.com/immu4989/dspy-security-bench}},
}
License
Apache License 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dspy_security_bench-0.1.0.tar.gz.
File metadata
- Download URL: dspy_security_bench-0.1.0.tar.gz
- Upload date:
- Size: 231.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b7ad693ff20fa4c1f26aa275ed0a044e3448776971ba12cf6994637a68e7489
|
|
| MD5 |
285d6a49d45e2cd86eea772f6f80036e
|
|
| BLAKE2b-256 |
f5473f670017112fed384e9e3456bac15b261cc7e4db789784fb663613d88dce
|
File details
Details for the file dspy_security_bench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dspy_security_bench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a88e0a56919cd53224cd2670bd3246fea0810bb189f498044fbb52546a4bb6b
|
|
| MD5 |
93a668484e638ccfbaf8e430379bb98f
|
|
| BLAKE2b-256 |
fe6213e20b7bba1684f401302c7a3b9e1f6fbb71596d28387aac8d38598db1f6
|