Skip to main content

An adversarial benchmark foundry for LLM safety: audit attack corpora, score benchmark staleness, compare defences (ASR + false-refusal), test multilingual over-refusal, and export safe challenge packs.

Project description

redteam-foundry

An adversarial benchmark foundry for LLM safety: audit attack corpora, measure defence impact, score benchmark staleness, test multilingual over-refusal, and export safe challenge packs — all with judge-validated, reproducible numbers. A measurement tool, not a weapon (see ETHICS.md).

CI License: MIT Python 3.13

Positioning

redteam-foundry is the upstream research layer of a two-layer AI-safety stack. It validates adversarial corpora, measures defence effectiveness, studies whether published benchmarks still measure real deployment risk, and exports safe challenge packs for downstream release-gating systems to consume.

It deliberately does not make production release decisions. Ship / warn / block, incident replay, and policy-as-code gates belong in a separate deployment layer — see Relationship to agent-release-gates.

Validate the benchmark before you trust the gate.

This repo (research layer) A release-gate layer
Job Discover, validate, and package adversarial benchmarks Replay incidents, apply policy, decide ship/warn/block
Output Audited corpora, judge-validated ASR/defence measurements, challenge packs Deployment evidence, release decisions
Question "Is this benchmark still meaningful, and how much do I trust the score?" "Is this agent safe to ship right now?"

What it does

  • Runs published adversarial prompts (AdvBench, JailbreakBench, HarmBench, AgentDojo — each pinned to an upstream commit) against target LLMs through composable, togglable defence stacks, and reports attack-success rate (ASR) with bootstrap confidence intervals and real API cost.
  • Validates its own numbers: every verdict is scored by an LLM judge and re-scored by an independent second judge; agreement (Cohen's κ, Krippendorff's α) is a first-class output.
  • Audits corpus quality: exact + near-duplicate detection (including cross-source overlap), language/script coverage, attack-family markers, and label-integrity checks → a quality report and a data card.
  • Scores benchmark staleness: a transparent, component-broken-out heuristic answering "is this a robust model, or a stale benchmark?".
  • Measures over-blocking: a benign control set (English + Traditional/ Simplified Chinese, Japanese, Korean, and code-switched) yields false-refusal rate (FRR) and a combined safe-usefulness score per defence.
  • Exports challenge packs: versioned, self-describing fixtures a downstream release gate can consume — with adversarial prompts redacted by default.
  • Interoperates: any run exports to a UK AISI Inspect eval log.

Headline finding

Across 2 target models, 2 benchmark families, and up to 4 composable defence configurations (12 evaluation cells), published adversarial prompts succeed between 0% and 4% of the time, and a paranoid prompt-only defence stack does not measurably move that number. Judge agreement on attack success is perfect — Cohen's κ = +1.00 in all 12 cells.

Attack success rate across all 12 evaluation cells — point estimate with 95% bootstrap confidence interval. Ten of the twelve cells sit at 0%; the two non-zero cells are the AdvBench Llama baseline at 1% and the AgentDojo Llama baseline at 4%.

[!NOTE] A near-zero ASR is a result about the benchmark, not just the model. It can mean the model is robust or the benchmark is stale — and ASR alone cannot tell them apart. Measuring that difference (staleness, defence sensitivity, multilingual over-refusal) is what this foundry is for. Full numbers, validation, and limits are in METHODOLOGY.md.

Getting started

git clone https://github.com/rosscyking1115/redteam-foundry.git
cd redteam-foundry
uv venv --python 3.13
source .venv/bin/activate            # macOS/Linux
# .venv\Scripts\activate             # Windows PowerShell
uv pip install -e ".[dev]"
cp .env.example .env                 # fill in ANTHROPIC_API_KEY
pre-commit install
pytest tests/unit                    # should pass green
redteam version                      # prints the installed version

The CLI is redteam ... (equivalently python -m redteam ...).

[!WARNING] Live runs call paid APIs. Each run enforces a hard USD budget cap (set per config in configs/), and the judge/target adapters enforce a per-call cap — but set a matching console budget cap before your first run anyway.

Commands

Benchmark research (the foundry) — offline, no API key

These analyse corpora and existing run artifacts; they need cached corpora but no live model calls.

# Audit corpora: duplicates, cross-source overlap, language + attack-family
# coverage, label issues -> quality report + data card + JSON.
redteam corpora audit --output reports/corpus_audit/

# Audit ANY Hugging Face adversarial dataset, not just the built-in four.
redteam corpora audit-hf --dataset owner/name --prompt-column prompt --revision <sha>

# Score benchmark staleness (heuristic). Pass --run for evaluation JSONs to
# light up the run-based components (universal-low-ASR, defence-insensitivity,
# judge-disagreement); corpus-only otherwise.
redteam corpora staleness --only agentdojo --run results/<run>.cross-judged.json

# Compare defences on ASR, false-refusal rate, safe-usefulness, cost, latency.
redteam compare-defences --run results/<adv>.judged.json --benign-run results/<benign>.json

# False-refusal rate broken down by language (over a benign run).
redteam frr-by-language --run results/<benign_multilingual>.json

# Export a versioned challenge pack (adversarial prompts redacted by default).
redteam export-pack --pack-id my-pack --only advbench

# Write the benign control sets to JSONL for inspection / running.
redteam benign export                 # English control set
redteam benign export --multilingual  # zh-Hant/zh-Hans/ja/ko + code-switch

Measurement core — needs an API key / local Ollama

redteam corpora download                                   # fetch + pin corpora
redteam run --config configs/run_anthropic_baseline.yaml   # evaluate
redteam score --run results/<run>.json                     # LLM-judge scoring
redteam cross-judge --run results/<run>.judged.json        # second judge + agreement
redteam export-inspect --run results/<run>.json            # UK AISI Inspect log

Run redteam --help for the full command list; every sub-command has --help.

Why this reports ASR and not refusal rate

The cross-judge layer found that ASR is well-posed and refusal_rate is not: the two judges agree perfectly on whether an attack succeeded, but disagree — sometimes worse than chance — on whether a response was a "refusal", because an indirect-injection task has two things that can be refused (the user's request and the injected instruction). refusal_rate is therefore reported as a descriptive signal of response style only, never as a safety metric. This is documented, not hidden — see METHODOLOGY.md §7.

Relationship to agent-release-gates

This repository is the upstream adversarial benchmark layer: it validates static attack corpora, measures defence stacks, scores its own reliability, and exports safe challenge packs. Production release decisions — incident replay, policy-as-code gates, deployment evidence, and ship / warn / block recommendations — are deliberately out of scope. A useful mental model:

  • redteam-foundry discovers, validates, and packages adversarial scenarios.
  • a release-gate layer (agent-release-gates) consumes selected scenarios as regression and release-readiness checks.

A benchmark research tool should not be the thing that decides whether an agent ships, and a release gate is only as trustworthy as the benchmarks feeding it. (agent-release-gates is a companion project; this section documents the intended split.)

Ethics

[!IMPORTANT] This project uses only published adversarial prompts and does not generate novel jailbreaks in any language. Excluded categories (CSAM, weapons-of-mass-destruction synthesis, detailed self-harm methods) are filtered at corpus-load time and verified by a CI test. Results are aggregate; exported adversarial prompts are redacted. The multilingual work is benign-only. Full policy in ETHICS.md.

If you are a model provider whose model is included and want example transcripts removed, email rosscyking@gmail.com24-hour removal commitment.

Development

scripts/ci_local.ps1 (Windows) and scripts/ci_local.sh (Linux/macOS) run the exact same checks as CI — ruff lint, ruff format check, mypy, pytest. Green locally means green on the PR. Run artifacts (results/), audit outputs (reports/), and non-sample packs (challenge_packs/) are gitignored — all re-creatable from configs.

Documentation

File What's in it
Finding: are jailbreak benchmarks still worth running? The write-up: staleness, a cross-dataset quality scorecard, and the multilingual result
METHODOLOGY.md Source of truth for every reported number; metric validation; limits
ETHICS.md Excluded categories, redaction, disclosure, provider ToS
docs/ROADMAP.md The foundry pivot, phase status, and follow-up hardening
CONTRIBUTING.md Scope, dev setup, and the ethics rules for adding corpora
CHANGELOG.md Release history
reports/samples/ Committed real-data findings (staleness, defence comparison, data card)

Citation

@software{redteam_foundry_2026,
  title  = {redteam-foundry: An adversarial benchmark foundry for LLM safety},
  author = {Ross},
  year   = {2026},
  url    = {https://github.com/rosscyking1115/redteam-foundry}
}

Licence

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redteam_foundry-0.2.0.tar.gz (296.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

redteam_foundry-0.2.0-py3-none-any.whl (104.6 kB view details)

Uploaded Python 3

File details

Details for the file redteam_foundry-0.2.0.tar.gz.

File metadata

  • Download URL: redteam_foundry-0.2.0.tar.gz
  • Upload date:
  • Size: 296.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for redteam_foundry-0.2.0.tar.gz
Algorithm Hash digest
SHA256 515bd7f5e0e9303ae30b0931df5e13490921ac589104cd139eb80585b5a137b0
MD5 345abe9450419026d5a726a150ae59f7
BLAKE2b-256 935d0a130809db2821fc2afda0824db1d5af45f029c1a8df52dadf42216584d8

See more details on using hashes here.

File details

Details for the file redteam_foundry-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: redteam_foundry-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 104.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for redteam_foundry-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c9d0fa7f22a22e5d05379d594ea62c7f8917e7c6824596b31322132ce2f2542e
MD5 53953dcb62eb0a49a9c2b23b5eefedde
BLAKE2b-256 c2358af71d98c508517132f36468030c803861a7fb0fe7f7283465b6fdf27e09

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page