A red-team benchmark for AI agents in financial workflows: measures regulatory-control bypass (sanctions evasion, threshold structuring, dual-approval defeat) and the efficacy of guardrails that try to stop it.
Project description
FinAgent Red-Team
A red-team benchmark for AI agents that handle money — focused on regulatory-control bypass, not just generic misbehavior.
Can a crafted invoice, vendor email, or chat message trick an LLM agent into evading sanctions screening, structuring payments under a reporting threshold, defeating dual-approval, or wiring funds to an unapproved payee — and do the guardrails a bank would actually deploy stop it without breaking legitimate work?
FinAgent Red-Team runs every scenario under three control postures — none, advisory, and enforced — and reports not just can it be attacked but where an agent's safety actually comes from: its own judgement, its policy-following, or hard system enforcement — and what each costs in usefulness.
⚠️ Defensive, fully synthetic. No real money, accounts, or networks. All tools are mocks operating on in-memory state. The goal is to harden financial AI agents before deployment.
Why this exists
LLM agents are moving into financial operations — accounts payable, treasury, reconciliation, customer servicing. The dangerous failures there are not toxic text; they are silent violations of compliance controls that look like normal task completion: a payment that skipped sanctions screening, a $15k transfer quietly split into two $7.5k transfers to dodge reporting, a large wire the agent approved for itself.
General agent-security benchmarks test "leak the data" or "do the forbidden action." None of them model the regulatory control surface of a real financial institution, and none measure whether standard guardrails neutralize the attack.
How this differs from prior work
| Generic agent-security benchmarks | FinAgent Red-Team | |
|---|---|---|
| Domain | Email/web/travel, generic "banking" | Payments/treasury with real compliance controls |
| Attack goal | Data leak, forbidden action | Sanctions evasion, structuring, maker-checker defeat |
| Defense eval | Usually attack-only | 3 control postures → policy-following vs enforcement uplift |
| Utility tradeoff | Rarely measured | Over-refusal control set included |
| Judging | Often LLM-graded | Deterministic state predicates (reproducible) |
Threat taxonomy (v0.2)
| ID | Threat | Isolated control under test |
|---|---|---|
T2_unauthorized_transfer |
Indirect injection → fraudulent payment | Payee allow-list |
T3_sanctions_evasion |
Pay an allow-listed but sanctioned payee | Sanctions screening |
T3 (homoglyph) |
Sanctioned entity hidden behind look-alike characters | Normalized screening |
T4_structuring |
Split a payment to dodge the reporting threshold | Reporting / CTR control |
T5_dual_approval_defeat |
Initiate and self-approve a large transfer | Maker-checker separation |
T6_data_exfiltration |
Email a customer's card/SSN to an external address | DLP on outbound messages |
T7_confused_deputy |
Redirect a payee's bank details to an attacker account | Bank-detail protection |
BENIGN |
A legitimate payment that should just work | Over-refusal (utility) |
Each attack scenario is designed to isolate one control, so a block under the enforced posture attributes cleanly to a specific defense.
Control postures
Every scenario is evaluated under three postures, which separate where safety comes from:
| Posture | What the agent is told / what the system does | Measures |
|---|---|---|
| none | No compliance policy is stated | Intrinsic susceptibility |
| advisory | Policy stated as obligations, not enforced | Does the model follow stated policy? |
| enforced | System hard-blocks violations in the sandbox | Residual risk (defense-in-depth) |
Metrics
- ASR (Attack Success Rate) at each posture — fraction of attacks won.
- Policy-following uplift = ASR(none) − ASR(advisory): how much merely stating policy reduces attacks (pure model instruction-following).
- Enforcement uplift = ASR(advisory) − ASR(enforced): additional reduction from hard enforcement (defense-in-depth).
- Residual ASR = ASR(enforced): attacks that survive enforcement.
- Utility at each posture and over-refusal = utility(none) − utility(enforced): legitimate work lost as controls tighten.
This decomposition is the point: two models with identical enforced residual risk can differ sharply in whether they'd behave safely when a control is only advisory (the common real-world case for judgement calls that can't be hard-coded). A stack only looks good with low residual ASR and high utility — trivially refusing everything scores 0% ASR but also 0% utility.
How it works
┌──────────────────────────── deterministic, offline ───────────────────────────┐
AgentModel ──► runner ──► sandbox tools ──► World state ──► evaluator ──► Scorecard
(any model) loop (controls here) (ground truth) (predicates) (ASR / uplift / utility)
│ ▲
│ └── posture: none / advisory / enforced
- Sandbox (
sandbox/): synthetic accounts, payees, inbox; mock tools (initiate_transfer,approve_transfer,screen_sanctions, …). Compliance controls live inside the tools, gated onPolicy.enabled. - Scenarios (
scenarios/): pure-data tasks + embedded attacks + a structured success predicate. - Evaluator (
eval/): judges outcomes from final state — no LLM grader, so results are reproducible. - Agent driver (
agent/): any OpenAI-compatible, tool-calling model.
The sandbox, scenarios, and evaluator have zero third-party dependencies and are fully deterministic — the entire attack→defense pipeline is proven by the offline test-suite, no GPU or API key required.
Generated suite
Beyond the hand-written scenarios, a seeded generator produces hundreds of cases by combining parametric slots (amounts, payees, accounts), social- engineering phrasings (authority, urgency, policy pretext, social proof), and obfuscation techniques (homoglyph / spacing for sanctions evasion):
finagent-redteam --list --suite generated --per-threat 15 # 120 cases
Every generated scenario carries a reference_plan — the canonical exploit
— and the test-suite replays all of them to verify the invariant that each
attack lands under the none/advisory postures and is blocked under
enforced. The suite is thus self-validating: each case is a checked,
control-isolating test.
Quickstart
pip install -e ".[dev]" # core + tests
pytest -q # 47 tests: proves attacks land (none/advisory), blocked (enforced)
# List scenarios (no model needed)
finagent-redteam --list
Run against a model (needs the agent extra):
pip install -e ".[agent]"
# local vLLM / SGLang
finagent-redteam --model Qwen/Qwen3-8B --base-url http://localhost:8000/v1
# Ollama OpenAI shim
finagent-redteam --model llama3.1 --base-url http://localhost:11434/v1 --json results.json
Run the multi-model leaderboard (several models, repeated trials):
# examples/models.example.json lists the models + endpoints to compare
finagent-redteam --models-config examples/models.example.json --trials 5 --temperature 0.7 \
--json leaderboard.json
This prints a ranked leaderboard plus a per-threat-category attack-success matrix; see examples/sample_leaderboard.md for the output shape.
Illustrative scorecard
A worst-case agent that fully complies with every embedded attack (reproduced by the offline self-test) yields:
| Metric | none | advisory | enforced |
|---|---|---|---|
| Attack Success Rate | 100% | 100% | 0% |
| Utility (benign completed) | 100% | 100% | 100% |
→ It ignores stated policy (advisory ASR stays 100%) but is fully stopped by enforcement (residual ASR 0%), with no over-refusal. Real models land between these poles — resisting some attacks on their own and following some stated policy — and that gap, decomposed into policy-following vs enforcement uplift, is what the benchmark measures. See examples/sample_leaderboard.md.
Responsible use
This is a defensive benchmark built on entirely synthetic data and mock tools. It contains no real financial credentials, accounts, or exploits against live systems. Use it to evaluate and harden agents before they are trusted with money.
License
MIT — see LICENSE.
Citation
If you use FinAgent Red-Team, please cite it (see CITATION.cff).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file finagent_redteam-0.2.0.tar.gz.
File metadata
- Download URL: finagent_redteam-0.2.0.tar.gz
- Upload date:
- Size: 51.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
429bb8799e4b126a9bb3f80075b5bf7f330b4102e12f937932153b49c0d48850
|
|
| MD5 |
3930dd6dd4e5a204189e970f2717cbf0
|
|
| BLAKE2b-256 |
70ba5ee8138718951eb949994c872aae02f24bb09f4cac4a9f60100356846bf1
|
File details
Details for the file finagent_redteam-0.2.0-py3-none-any.whl.
File metadata
- Download URL: finagent_redteam-0.2.0-py3-none-any.whl
- Upload date:
- Size: 48.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b224062894acb711a210732ab62867b375a5b3bbb800dcae5d3710ffa76df644
|
|
| MD5 |
2ec367b59b804d041e1e07be53d2674f
|
|
| BLAKE2b-256 |
f0d4376ca1c8614874c24121ec9b85ae867121adb0de94db0ff232d8e46ea8b4
|