Skip to main content

A reproducible, defensive red-team range for financial-services AI agents.

Project description

🛡️ FinAgent-RedRange

A reproducible, defensive red-team range for financial-services AI agents.

CI  PyPI  Python  Tests  Lint  Types  Code license  Docs license

Develop proof-of-concept exploits against a mock retail-banking agent, then prove that specific guardrails close each one — end to end, from POC through regression test.
Build the attack only to prove the defense.

🔒 Defensive research only. The single target is the bundled mock agent; all data is synthetic. Every exploit ships with the control that blocks it and a regression test that keeps it closed. See SECURITY.md.

📖 New here? Start with the guided walkthrough — a narrated tour of what the range does, how to run it, and what each output means.

At a glance

Scenarios 9 — prompt injection · data poisoning · excessive agency · system-prompt leakage · unsafe output · vector/embedding weakness · unbounded consumption · supply chain · multimodal injection
Coverage 10/10 OWASP LLM risks (9 dedicated POC+control scenarios — incl. a multimodal input surface — + LLM02 & LLM09 as impact tags) · both OWASP agentic schemes — Threats & Mitigations (T1/T2/T3/T4/T6) and the 2026 Top 10 for Agentic Applications (ASI01–04, 06, 09) · MITRE ATLAS · NIST AI RMF
Result every attack 🔴 exploited (controls off) → 🟢 blocked (controls on); mean AIRQ heuristic High → Medium
Extras permission-checked tool loop · sweep + adaptive-LLM autonomous attacker · semantic real-model oracle · md / json / html scorecard
Handouts ready-to-use exports for security teams — Sigma detection pack (measured precision) · SARIF 2.1.0 findings · GSN assurance case · regulatory crosswalk (NIST/ISO 42001/EU AI Act) · ATLAS Navigator coverage layer. See docs/HANDOUTS.md
Runs fully offline & deterministic — no API key · 97 tests green in CI (Python 3.11 / 3.12)
Try it pip install finagent-redrange && python -m finagent_redrange run (or pip install -e ".[dev]" from a clone)

FinAgent-RedRange scorecard — eight scenarios exploited with controls off and blocked with controls on, OWASP LLM and Agentic coverage matrices, and the autonomous-attacker result
The headline artifact: python -m finagent_redrange run regenerates this scorecard (md / json / html).


Threat model

flowchart LR
    U[User / Attacker] -->|prompt| GR_IN[Input guardrails]
    IMG[(Image input<br/>vision / OCR)] -->|extracted text| GR_IN
    GR_IN --> A[Banking agent<br/>LLM planner]
    DOCS[(Policy & knowledge<br/>RAG corpus)] -->|retrieved context| GR_RET[Retrieval guardrails<br/>allowlist · integrity · provenance]
    GR_RET --> A
    A -->|tool calls| GR_ACT[Action guardrails<br/>high-risk authorization]
    GR_ACT --> T{Tool layer<br/>+ permission checks}
    T --> BAL[get_balance]
    T --> XFER[transfer_funds]
    T --> KYC[lookup_kyc]
    T --> TXN[list_transactions]
    T --> TIX[create_support_ticket]
    A -->|draft answer| GR_OUT[Output guardrails<br/>PII · leak · link filters]
    GR_OUT -->|final answer| U

    %% attack surfaces (the nine scenarios — full OWASP LLM Top 10 + a multimodal input surface)
    INJ([Indirect prompt injection]):::atk -.poisons.-> DOCS
    POI([Data poisoning]):::atk -.corrupts.-> DOCS
    AGY([Excessive agency]):::atk -.coerces.-> T
    LEAK([System-prompt leakage]):::atk -.extracts.-> A
    OUT([Unsafe output handling]):::atk -.rides out via.-> GR_OUT
    VEC([Vector/embedding weakness]):::atk -.cross-tenant leak via.-> DOCS
    CON([Unbounded consumption]):::atk -.floods.-> T
    SUP([Supply chain]):::atk -.injects malicious tool into.-> T
    MM([Multimodal injection]):::atk -.hides in.-> IMG

    classDef atk fill:#fde,stroke:#c39,color:#000;

Modeled attack surfaces: indirect prompt injection via retrieved documents, data poisoning of the trusted knowledge store, excessive agency / tool misuse, system-prompt leakage, unsafe output handling, vector/embedding weakness (cross-session retrieval), unbounded consumption (tool-budget exhaustion), supply chain (malicious third-party tool), and multimodal injection (an instruction hidden in an uploaded image's OCR text) — full OWASP LLM Top 10 coverage plus a multimodal input surface. Surfaces and findings are mapped to OWASP LLM Top 10, both OWASP agentic schemes (Threats & Mitigations T1–T15 and the 2026 Top 10 for Agentic Applications ASI01–ASI10), MITRE ATLAS, and NIST AI RMF below.

Mitigation-validation results

The point of the range: each POC must land with controls off and fail with controls on. Run python -m finagent_redrange run to regenerate results/scorecard.{md,json,html}.

Scenario OWASP LLM Agentic (T&M · Top 10) ATLAS AIRQ (off→on) Controls off Controls on Validating control
Indirect prompt injection (cross-account PII) LLM01 · LLM02 T6 · ASI01 AML.T0051.001 High → Medium 🔴 exploited 🟢 blocked Output PII filter (+ provenance)
Data poisoning (fabricated policy) LLM04 · LLM09 T1 · ASI06 AML.T0070 High → Medium 🔴 exploited 🟢 blocked Source allowlist + integrity hash
Excessive agency (autonomous transfer) LLM06 · LLM01 T2 · T3 · ASI02 · ASI03 AML.T0053 High → Medium 🔴 exploited 🟢 blocked Action-authorization guardrail
System-prompt leakage LLM07 · LLM01 AML.T0056 Medium → Low 🔴 exploited 🟢 blocked Output system-prompt-leak detector
Unsafe output handling (malicious link) LLM05 · LLM02 ASI09 AML.T0052.000 Medium → Low 🔴 exploited 🟢 blocked Output link/markup sanitiser
Vector/embedding weakness (cross-session leak) LLM08 · LLM02 ASI03 AML.T0057 High → Medium 🔴 exploited 🟢 blocked Access-scoped retrieval
Unbounded consumption (tool-budget exhaustion) LLM10 T4 AML.T0034 Medium → Low 🔴 exploited 🟢 blocked Per-request tool-call budget
Supply chain (malicious third-party tool) LLM03 ASI04 AML.T0010.001 High → Medium 🔴 exploited 🟢 blocked Verified-publisher tool allowlist
Multimodal injection (image-borne instruction) LLM01 ASI01 AML.T0051 Medium → Low 🔴 exploited 🟢 blocked Multimodal input guardrail (OCR as data)

Regenerated on each run. Nine scenarios are dedicated POC+control pairs covering the full OWASP LLM Top 10 (LLM01/03/04/05/06/07/08/10 as primary risks, plus a multimodal input surface under LLM01), and all 10 once impact tags (LLM02, LLM09) are counted. The Agentic column carries both OWASP agentic schemes — the "Agentic AI — Threats and Mitigations" taxonomy (T1–T15) and the 2026 "Top 10 for Agentic Applications" (ASI01–ASI10); a cell is blank where no honest mapping exists in either. AIRQ (a heuristic defined for this project, not an external standard; AS = Attack Surface, BR = Blast Radius, DC = Defense Controls) is an illustrative analyst heuristic for prioritization, not a calibrated metric — the controls-on DC is the control's asserted strength, so "High → Medium" is the intended mitigation effect, not a measured residual-risk number. ATLAS rows are closest-fit (see the scorecard's Notes). Full matrix in results/scorecard.md.

Strategy-sweep attacker

python -m finagent_redrange auto turns an attacker loose on an objective ("extract the agent's hidden system prompt"). Two planners share one seam (attacker/engine.py): the default --planner sweep runs a deterministic fixed product of seed payloads × transforms (base64, role-play, crescendo) — offline and CI-friendly — while --planner llm (pair with --model claude) is an adaptive LLM planner that reasons about which seed + transform to try next given what's already been tried and whether it landed. With controls off it lands; with controls on it is defeated by layered defense — the base64-obfuscated probe slips past the input filter but the output canary detector catches the leak, and the direct phrasings are caught by the input filter. The headline defensive result: the control holds even as the attacker works through every strategy it has.

Quickstart

git clone https://github.com/emmanuelgjr/finagent-redrange.git && cd finagent-redrange
python -m venv .venv && source .venv/bin/activate   # Windows (PowerShell): .venv\Scripts\Activate.ps1
pip install -e ".[dev]"

# offline, deterministic (no API key needed) — uses the EchoClient
python -m finagent_redrange run             # all 8 scenarios, controls off then on -> scorecard
python -m finagent_redrange run --handouts  # + Sigma pack, SARIF, GSN assurance case (docs/HANDOUTS.md)
python -m finagent_redrange auto            # turn the autonomous attacker loose on an objective

# against a real model (full tool-execution loop with permission-checked tools)
cp .env.example .env             # add ANTHROPIC_API_KEY
pip install -e ".[anthropic]"    # real-model runs also need the Anthropic SDK
python -m finagent_redrange run --model claude --controls off
python -m finagent_redrange run --model claude --controls on   # mitigations enabled

pytest -q   # regression suite: with controls on, every known attack must stay blocked

Outputs land in results/ as scorecard.md (the table above), scorecard.json (machine-readable, CI-friendly), and scorecard.html (a standalone styled report for screen-sharing). Adding --handouts also writes results/sigma/ (Sigma detection rules + a labeled-replay precision report), results/findings.sarif (SARIF 2.1.0), results/assurance/ (a GSN control-effectiveness assurance case), results/compliance/ (a regulatory control crosswalk to NIST AI RMF / ISO 42001 / EU AI Act), and results/navigator/ (a MITRE ATLAS Navigator coverage layer). All are regenerated on each run; none are committed. See docs/HANDOUTS.md for what each is, what it provides per persona, and how its precision is validated.

Architecture

Package Responsibility
target/ The system under test — a mock banking agent: a plan→act→observe tool loop over permission-checked tools, with toggleable input / retrieval / action / output guardrails
attacker/ Red-team engine: scripted run_campaign + autonomous run_autonomous (composes seeds × transforms until an oracle fires)
scenarios/ One attack class per file (9): indirect prompt injection, data poisoning, excessive agency, system-prompt leakage, unsafe output handling, vector/embedding weakness, unbounded consumption, supply chain, multimodal injection — full OWASP LLM Top 10 coverage + a multimodal input surface
scoring/ Framework crosswalk (OWASP / ATLAS / NIST) + AIRQ risk scoring + scorecard renderer (md / json / html)
exports/ Handout exporters generated from Findings — Sigma detection pack + labeled-replay precision harness, SARIF 2.1.0 findings, GSN assurance case, regulatory crosswalk (NIST/ISO 42001/EU AI Act), ATLAS Navigator coverage layer (see docs/HANDOUTS.md)
llm/ Provider-agnostic client returning structured ModelResponse (text + tool calls); EchoClient runs offline for tests, AnthropicClient for real-model runs

Full design notes for contributors (human or agent) live in CLAUDE.md.

Why this design

  • POC-to-validation, not POC-alone. A finding isn't done until the control that blocks it is proven by a passing regression test. That's the loop a bank actually needs.
  • Framework-mapped by construction. Findings carry OWASP/ATLAS/NIST IDs and AIRQ sub-scores as structured fields, so they drop straight into governance and audit workflows.
  • Black/grey-box discipline. The attacker only touches the agent's public respond() surface — the same position a real adversary occupies.
  • Reproducible. One-command run and a deterministic offline mode; CI exercises the suite on Python 3.11 / 3.12.
  • Honest crosswalk, adversarially reviewed. Framework IDs were verified against the published standards (e.g. OWASP LLM05 2025 = Improper Output Handling; agentic threats use the OWASP T1–T15 scheme), and a multi-agent adversarial review hardened the oracles so each scenario is blocked by the control its scorecard names — not incidentally by another.

Roadmap

  • Autonomous attacker-agent loop ✅ shipped (attacker/run_autonomous).
  • LLM-driven attacker planner ✅ shipped — the planner is now a pluggable seam with two implementations: the deterministic SweepPlanner (offline default) and an adaptive LLMPlanner that reasons about the next seed + transform from the feedback of prior attempts (auto --planner llm --model claude).
  • Excessive agency, system-prompt leakage, unsafe output handling scenarios ✅ shipped.
  • Semantic oracles for real-model runs ✅ shipped (scenarios/judge.py: an adoption-vs-refutation judge — deterministic offline, a semantic LLM judge on --model claude — so a model that quotes a poisoned claim to refute it is scored as a refusal, not an exploit).
  • Fill the remaining OWASP gaps (LLM03 supply chain, LLM08 vector/embedding, LLM10 unbounded consumption) ✅ shipped — full OWASP LLM Top 10 coverage (8 dedicated POC+control scenarios).
  • CI regression gate ✅ shipped (ruff + mypy + pytest on Python 3.11/3.12).
  • Ready-to-use handout exports for security teams ✅ shipped — a Sigma detection pack with a labeled-replay precision gate (8 TP / 0 FP / 0 FN), a SARIF 2.1.0 findings run, a GSN control-effectiveness assurance case with zero-orphan-claim traceability, and an interpretive regulatory crosswalk (NIST AI RMF + GenAI Profile, ISO/IEC 42001, EU AI Act) with declared-vs-interpretive provenance labeling (exports/, run --handouts). See docs/HANDOUTS.md.
  • Multimodal attack surfaces ✅ shipped — a multimodal injection scenario: an instruction hidden in an uploaded image's OCR text, blocked by a multimodal input guardrail that treats extracted image text as untrusted data (target/agent.py gained an optional images= surface).
  • Publish to PyPI ✅ shipped — finagent-redrange on PyPI (pip install finagent-redrange), released via a secure OIDC Trusted-Publishing workflow (.github/workflows/publish.yml) — no token stored.
  • Seed the attacker from a larger real-world incident dataset ✅ shipped — the optional [incidents] extra (pip install "finagent-redrange[incidents]") seeds the autonomous attacker from the genai-incidents corpus (12k+ real GenAI/agentic incidents): SeedLibrary.from_genai_incidents() maps each incident to a scenario technique, orders by real-world severity, and records its provenance. Scope-safe: incidents supply the technique + prioritization + provenance only — every payload stays a synthetic mock-agent probe (no incident text is reproduced). Data credited under CC BY 4.0 (see NOTICE).

License & citation

Dual-licensed so the work stays usable while attribution stays required:

  • CodeApache License 2.0: permissive, with an explicit patent grant and attribution propagation via the NOTICE file.
  • Documentation & research (the docs/ directory, this README, and the generated scorecards) — Creative Commons Attribution 4.0 (CC BY 4.0): reuse freely, but credit the author by name and link back.

If you use this project, its harness, its framework crosswalk, or its findings, please cite it — see CITATION.cff (GitHub's "Cite this repository" button). © 2026 Emmanuel Guilherme Junior.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

finagent_redrange-0.6.0.tar.gz (287.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

finagent_redrange-0.6.0-py3-none-any.whl (101.4 kB view details)

Uploaded Python 3

File details

Details for the file finagent_redrange-0.6.0.tar.gz.

File metadata

  • Download URL: finagent_redrange-0.6.0.tar.gz
  • Upload date:
  • Size: 287.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for finagent_redrange-0.6.0.tar.gz
Algorithm Hash digest
SHA256 f1e443ed78fb864f5aae515b4d42a95330a8a574b9227be41118d1b9ab9a8324
MD5 b8619f3a221c582929e237cc351cc2ea
BLAKE2b-256 ff08cad9f7605186c37d5df05b3ec6697494945b6b9e89ac8849f258daa19065

See more details on using hashes here.

Provenance

The following attestation bundles were made for finagent_redrange-0.6.0.tar.gz:

Publisher: publish.yml on emmanuelgjr/finagent-redrange

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file finagent_redrange-0.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for finagent_redrange-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c7a2b6577f542ab56efcdc8eb5e504ed69332dfdbc6ed866fd1f0d60b31c8933
MD5 da76b7d7c59063571cc49d07cabba93f
BLAKE2b-256 c6e937f3c6cf9562cb507c4a7bc13f11ef0fa4cf0898fc856e624f3935e5fe91

See more details on using hashes here.

Provenance

The following attestation bundles were made for finagent_redrange-0.6.0-py3-none-any.whl:

Publisher: publish.yml on emmanuelgjr/finagent-redrange

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page