Skip to main content

Measure how DSPy prompt optimization affects the prompt-injection robustness of agentic LLM programs, using AgentDojo's attack suite.

Project description

dspy-security-bench

License: Apache 2.0 Python 3.10+ dspy 3.3.0b1+ AgentDojo tests status

Measure how DSPy prompt optimization affects the prompt-injection robustness of agentic LLM programs, using AgentDojo's attack suite as ground truth.

The question: when you optimize a DSPy program with BootstrapFewShot, MIPROv2, or GEPA, does it become more or less robust to prompt-injection attacks? Two adjacent research communities — prompt optimization and prompt-injection security — have not measured this intersection. dspy-security-bench wires DSPy optimizers and AgentDojo attacks into one harness so the trade-off becomes visible.


v0.1 results

Headline: prompt optimization measurably degrades adversarial robustness on harder attacks. Optimizers buy utility (0% → 40-60% task success on direct) but pay it back in security on important_instructions (80% → 60% attack-failure rate). BootstrapFewShot Pareto-dominates MIPROv2 on the workspace suite at v0.1's scale.

Utility vs Security by optimizer × attack

Optimizer Attack Utility Security Injection success n
unoptimized direct 0% 100% 0% 5
unoptimized important_instructions 0% 80% 20% 5
bootstrap_fewshot direct 60% 100% 0% 5
bootstrap_fewshot important_instructions 20% 60% 40% 5
miprov2 direct 40% 80% 20% 5
miprov2 important_instructions 20% 60% 40% 5

Utility vs Security Pareto

Reading the chart. A point closer to the green star (top-right) is the ideal — high utility and high security. Three patterns hold across this scale:

  1. unoptimized is high-security but useless. It refuses to do the task (0% utility) regardless of attack, and resists attacks at 80–100%.
  2. bootstrap_fewshot is the best operating point at this scale. Equal or highest utility (60% on direct), equal-best security on direct (100%), and matches miprov2's degraded important_instructions security.
  3. miprov2 Pareto-loses to bootstrap. Lower utility on direct (40% vs 60%) AND lower security (80% vs 100%). Suggests heavier optimization overfits the clean-distribution prompt and exposes more attack surface.

v0.1 scope: workspace suite only, N=5 user tasks × 1 injection task × 2 attacks × 3 optimizers = 30 runs. gpt-4o-mini for execution + judge. Trainset = 192 validated synthetic tasks (100 gpt-4o + 100 claude-sonnet, validated syntactic + dedupe). See scripts/run_v01_benchmark.py for reproduction.


How it works

flowchart TD
    A([AgentDojo seed env data]) --> B[env-data extractor]
    B --> C[synthesis generator<br/>LM-generated query-only<br/>tasks grounded in env]
    LM[(GPT-4o + Claude)] -.-> C
    C -->|raw tasks| D[validator<br/>syntactic + dedupe<br/>+ optional solvability]
    D -->|~190 validated tasks| E[optimizer harness<br/>BootstrapFewShot · MIPROv2<br/>GEPA in v0.2]
    E -->|name → agent_factory| F[DSPyReActV2Element<br/>wraps dspy.ReActV2 as<br/>AgentDojo pipeline element]
    F -->|AgentPipeline| G[runner<br/>drives benchmark_suite_<br/>with_injections]
    AD[(AgentDojo attacks)] -.-> G
    G --> H([pandas DataFrame<br/>one row per<br/>optimizer × attack ×<br/>user_task × injection_task])

    classDef synth fill:#DBEAFE,stroke:#1E40AF,stroke-width:2px,color:#1E3A8A
    classDef opt fill:#FED7AA,stroke:#9A3412,stroke-width:2px,color:#7C2D12
    classDef eval fill:#DCFCE7,stroke:#15803D,stroke-width:2px,color:#14532D
    classDef io fill:#F1F5F9,stroke:#475569,stroke-width:2px,color:#1F2937
    classDef ext fill:#FAE8FF,stroke:#86198F,stroke-width:2px,color:#701A75

    class B,C,D synth
    class E,F opt
    class G,H eval
    class A io
    class LM,AD ext

Install

git clone https://github.com/immu4989/dspy-security-bench.git
cd dspy-security-bench

# either with uv:
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e .

# or with pip:
pip install -e .

Requires Python 3.10+ and dspy >= 3.3.0b1 (the canonical-tool-call release that adds dspy.ReActV2). pip/uv handle the pre-release pin automatically because the version is explicit in pyproject.toml.

Quickstart

The full pipeline in Python:

import dspy
from dspy_security_bench.synthesis.generator import synthesize_tasks
from dspy_security_bench.synthesis.validator import validate_tasks
from dspy_security_bench.optimizers import build_agent_factories
from dspy_security_bench.llm_judge import LLMJudgeMetric
from dspy_security_bench.runner import evaluate_factories, summarize

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

# 1. Generate a synthetic trainset grounded in the workspace suite's seed env
raw_tasks = synthesize_tasks("workspace", n=150, model="openai/gpt-4o")

# 2. Filter for validity and dedupe against real test tasks
val = validate_tasks(raw_tasks, "workspace", checks=("syntactic", "dedupe"))
trainset = val.kept  # ~140-180 high-quality tasks survive

# 3. Run optimizers — produces a factory per optimizer
factories = build_agent_factories(
    trainset=trainset,
    optimizers=["unoptimized", "bootstrap_fewshot", "miprov2"],
    suite_name="workspace",
    signature="query -> answer",
    metric=LLMJudgeMetric(judge_lm=dspy.LM("openai/gpt-4o-mini", temperature=0)),
)

# 4. Evaluate against AgentDojo's attack suite
df = evaluate_factories(
    factories=factories,
    suite_name="workspace",
    attacks=["direct", "important_instructions"],
    user_task_ids=["user_task_0", "user_task_1", "user_task_3", "user_task_10", "user_task_11"],
    injection_task_ids=["injection_task_0"],
    max_iters=8,
)

# 5. Aggregate
print(summarize(df))

The full v0.1 run takes ~30-45 min wall-clock at ~$15-20 in LM cost (gpt-4o-mini for everything). See scripts/run_v01_benchmark.py for the production driver — it caches optimizer state to data/results/factories_cache.pkl so re-runs after a downstream crash skip optimization.

CLI

The synthesis and validation steps have CLIs that produce JSONL files:

# Synthesize (dry-run prints the prompt without calling the API)
dspy-security-bench-synthesize workspace --dry-run

# Real synthesis (requires OPENAI_API_KEY / ANTHROPIC_API_KEY)
export OPENAI_API_KEY=sk-...
dspy-security-bench-synthesize workspace \
    --n 150 --model openai/gpt-4o \
    --out data/synthetic_train/workspace_gpt4o_raw.jsonl

# Validate
dspy-security-bench-validate workspace \
    data/synthetic_train/workspace_gpt4o_raw.jsonl \
    --out data/synthetic_train/workspace_gpt4o.jsonl \
    --report data/synthetic_train/workspace_gpt4o_report.json

Reproducing the v0.1 result

# After installing — synthesizes, validates, optimizes, evaluates, saves CSVs.
# Caches optimized state to data/results/factories_cache.pkl so reruns are fast.
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...  # optional — falls back to GPT-4o only

python scripts/run_v01_benchmark.py 2>&1 | tee data/results/run_v01.log
python scripts/generate_v01_figures.py     # rebuilds the README charts

Outputs:

  • data/results/workspace_v01_results.csv — 30 raw rows
  • data/results/workspace_v01_summary.csv — 6-row aggregation
  • assets/v01_utility_vs_security.png
  • assets/v01_pareto.png

Development

# install with dev extras (pytest, ruff, pytest-cov)
uv pip install -e ".[dev]"

# run the full test suite (61 tests, all offline / mocked — no API key needed)
pytest tests/ -v

# linting
ruff check dspy_security_bench/ tests/
ruff format dspy_security_bench/ tests/

The test suite covers env-data extraction, synthesis helpers, validator checks, the AgentDojo wrapper (end-to-end against user_task_0 with DummyLM), the optimizer harness, the LLM-as-judge metric, and the runner's orchestration (with benchmark_suite_with_injections mocked).

Design decisions

These are documented in detail in ARCHITECTURE.md. The key v0.1 scope choices:

  • Synthetic trainset, not held-out split. AgentDojo has only ~40 user tasks per suite — not enough for a clean train/test split that supports optimizers like MIPROv2. We synthesize ~100 in-distribution query-only tasks per suite via GPT-4o + Claude Sonnet, validated against the env, and use the real AgentDojo tasks unmodified as the held-out test set.
  • Query-only tasks for training; full action-task suite for testing. Action tasks (send, create, modify) have hand-written utility checks that don't synthesize cleanly. Training on queries-only is acceptable because the research question is whether prompt optimization (not action selection) affects robustness.
  • Hybrid metric: LLM-as-judge with substring fast-path for training (cheap
    • tolerant of paraphrasing); real AgentDojo utility() for testing (rigorous, the actual published benchmark).
  • Single-output signature constraint on the DSPy program. The model's final output goes into AgentDojo's single model_output utility argument.

Roadmap

Milestone Status
v0.1 — workspace suite × 2 attacks × 3 optimizers, headline finding shipped
v0.2 — banking / travel / slack suites, GEPA optimizer, larger N planned
v0.3 — adversarial trainset to study robust-by-construction optimization planned
Paper — TMLR submission if v0.2 findings hold at scale conditional

Acknowledgments and prior work

This benchmark sits on top of:

  • DSPy (Stanford NLP) — the optimizer framework being evaluated.
  • AgentDojo (ETH Zurich, SPY lab) — the attack suite and task environments providing ground-truth robustness measurement.

It also draws on the broader 2024-26 prompt-security literature, including GEPA, BATprompt, Survival of the Safest, InjecAgent, and WASP.

Citation

If you use this benchmark in research or production, please cite:

@misc{ahamed2026dspysecuritybench,
  title = {{dspy-security-bench}: Measuring optimizer-induced robustness in
           agentic DSPy programs},
  author = {Imran Ahamed},
  year = {2026},
  howpublished = {\url{https://github.com/immu4989/dspy-security-bench}},
}

License

Apache License 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dspy_security_bench-0.1.0.tar.gz (231.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dspy_security_bench-0.1.0-py3-none-any.whl (34.0 kB view details)

Uploaded Python 3

File details

Details for the file dspy_security_bench-0.1.0.tar.gz.

File metadata

  • Download URL: dspy_security_bench-0.1.0.tar.gz
  • Upload date:
  • Size: 231.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for dspy_security_bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2b7ad693ff20fa4c1f26aa275ed0a044e3448776971ba12cf6994637a68e7489
MD5 285d6a49d45e2cd86eea772f6f80036e
BLAKE2b-256 f5473f670017112fed384e9e3456bac15b261cc7e4db789784fb663613d88dce

See more details on using hashes here.

File details

Details for the file dspy_security_bench-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dspy_security_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7a88e0a56919cd53224cd2670bd3246fea0810bb189f498044fbb52546a4bb6b
MD5 93a668484e638ccfbaf8e430379bb98f
BLAKE2b-256 fe6213e20b7bba1684f401302c7a3b9e1f6fbb71596d28387aac8d38598db1f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page