An adversarial benchmark foundry for LLM safety: audit attack corpora, score benchmark staleness, compare defences (ASR + false-refusal), test multilingual over-refusal, and export safe challenge packs.
Project description
redteam-foundry
An adversarial benchmark foundry for LLM safety: audit attack corpora, measure defence impact, score benchmark staleness, test multilingual over-refusal, and export safe challenge packs — all with judge-validated, reproducible numbers. A measurement tool, not a weapon (see
ETHICS.md).
Positioning
redteam-foundry is the upstream research layer of a two-layer AI-safety
stack. It validates adversarial corpora, measures defence effectiveness, studies
whether published benchmarks still measure real deployment risk, and exports safe
challenge packs for downstream release-gating systems to consume.
It deliberately does not make production release decisions. Ship / warn / block, incident replay, and policy-as-code gates belong in a separate deployment layer — see Relationship to agent-release-gates.
Validate the benchmark before you trust the gate.
| This repo (research layer) | A release-gate layer | |
|---|---|---|
| Job | Discover, validate, and package adversarial benchmarks | Replay incidents, apply policy, decide ship/warn/block |
| Output | Audited corpora, judge-validated ASR/defence measurements, challenge packs | Deployment evidence, release decisions |
| Question | "Is this benchmark still meaningful, and how much do I trust the score?" | "Is this agent safe to ship right now?" |
What it does
- Runs published adversarial prompts (AdvBench, JailbreakBench, HarmBench, AgentDojo — each pinned to an upstream commit) against target LLMs through composable, togglable defence stacks, and reports attack-success rate (ASR) with bootstrap confidence intervals and real API cost.
- Validates its own numbers: every verdict is scored by an LLM judge and re-scored by an independent second judge; agreement (Cohen's κ, Krippendorff's α) is a first-class output.
- Audits corpus quality: exact + near-duplicate detection (including cross-source overlap), language/script coverage, attack-family markers, and label-integrity checks → a quality report and a data card.
- Scores benchmark staleness: a transparent, component-broken-out heuristic answering "is this a robust model, or a stale benchmark?".
- Measures over-blocking: a benign control set (English + Traditional/ Simplified Chinese, Japanese, Korean, and code-switched) yields false-refusal rate (FRR) and a combined safe-usefulness score per defence.
- Exports challenge packs: versioned, self-describing fixtures a downstream release gate can consume — with adversarial prompts redacted by default.
- Interoperates: any run exports to a UK AISI Inspect eval log.
Headline finding
Across 2 target models, 2 benchmark families, and up to 4 composable defence configurations (12 evaluation cells), published adversarial prompts succeed between 0% and 4% of the time, and a paranoid prompt-only defence stack does not measurably move that number. Judge agreement on attack success is perfect — Cohen's κ = +1.00 in all 12 cells.
[!NOTE] A near-zero ASR is a result about the benchmark, not just the model. It can mean the model is robust or the benchmark is stale — and ASR alone cannot tell them apart. Measuring that difference (staleness, defence sensitivity, multilingual over-refusal) is what this foundry is for. Full numbers, validation, and limits are in
METHODOLOGY.md.
Getting started
git clone https://github.com/rosscyking1115/redteam-foundry.git
cd redteam-foundry
uv venv --python 3.13
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows PowerShell
uv pip install -e ".[dev]"
cp .env.example .env # fill in ANTHROPIC_API_KEY
pre-commit install
pytest tests/unit # should pass green
redteam version # prints the installed version
The CLI is redteam ... (equivalently python -m redteam ...).
[!WARNING] Live runs call paid APIs. Each run enforces a hard USD budget cap (set per config in
configs/), and the judge/target adapters enforce a per-call cap — but set a matching console budget cap before your first run anyway.
Commands
Benchmark research (the foundry) — offline, no API key
These analyse corpora and existing run artifacts; they need cached corpora but no live model calls.
# Audit corpora: duplicates, cross-source overlap, language + attack-family
# coverage, label issues -> quality report + data card + JSON.
redteam corpora audit --output reports/corpus_audit/
# Audit ANY Hugging Face adversarial dataset, not just the built-in four.
redteam corpora audit-hf --dataset owner/name --prompt-column prompt --revision <sha>
# Score benchmark staleness (heuristic). Pass --run for evaluation JSONs to
# light up the run-based components (universal-low-ASR, defence-insensitivity,
# judge-disagreement); corpus-only otherwise.
redteam corpora staleness --only agentdojo --run results/<run>.cross-judged.json
# Compare defences on ASR, false-refusal rate, safe-usefulness, cost, latency.
redteam compare-defences --run results/<adv>.judged.json --benign-run results/<benign>.json
# False-refusal rate broken down by language (over a benign run).
redteam frr-by-language --run results/<benign_multilingual>.json
# Export a versioned challenge pack (adversarial prompts redacted by default).
redteam export-pack --pack-id my-pack --only advbench
# Write the benign control sets to JSONL for inspection / running.
redteam benign export # English control set
redteam benign export --multilingual # zh-Hant/zh-Hans/ja/ko + code-switch
Measurement core — needs an API key / local Ollama
redteam corpora download # fetch + pin corpora
redteam run --config configs/run_anthropic_baseline.yaml # evaluate
redteam score --run results/<run>.json # LLM-judge scoring
redteam cross-judge --run results/<run>.judged.json # second judge + agreement
redteam export-inspect --run results/<run>.json # UK AISI Inspect log
Run redteam --help for the full command list; every sub-command has --help.
Why this reports ASR and not refusal rate
The cross-judge layer found that ASR is well-posed and refusal_rate is not:
the two judges agree perfectly on whether an attack succeeded, but disagree —
sometimes worse than chance — on whether a response was a "refusal", because an
indirect-injection task has two things that can be refused (the user's request
and the injected instruction). refusal_rate is therefore reported as a
descriptive signal of response style only, never as a safety metric. This is
documented, not hidden — see METHODOLOGY.md §7.
Relationship to agent-release-gates
This repository is the upstream adversarial benchmark layer: it validates static attack corpora, measures defence stacks, scores its own reliability, and exports safe challenge packs. Production release decisions — incident replay, policy-as-code gates, deployment evidence, and ship / warn / block recommendations — are deliberately out of scope. A useful mental model:
redteam-foundrydiscovers, validates, and packages adversarial scenarios.- a release-gate layer (
agent-release-gates) consumes selected scenarios as regression and release-readiness checks.
A benchmark research tool should not be the thing that decides whether an agent
ships, and a release gate is only as trustworthy as the benchmarks feeding it.
(agent-release-gates is a companion project; this section documents the
intended split.)
Ethics
[!IMPORTANT] This project uses only published adversarial prompts and does not generate novel jailbreaks in any language. Excluded categories (CSAM, weapons-of-mass-destruction synthesis, detailed self-harm methods) are filtered at corpus-load time and verified by a CI test. Results are aggregate; exported adversarial prompts are redacted. The multilingual work is benign-only. Full policy in
ETHICS.md.
If you are a model provider whose model is included and want example transcripts removed, email rosscyking@gmail.com — 24-hour removal commitment.
Development
scripts/ci_local.ps1 (Windows) and scripts/ci_local.sh (Linux/macOS) run the
exact same checks as CI — ruff lint, ruff format check, mypy, pytest. Green
locally means green on the PR. Run artifacts (results/), audit outputs
(reports/), and non-sample packs (challenge_packs/) are gitignored — all
re-creatable from configs.
Documentation
| File | What's in it |
|---|---|
| Finding: are jailbreak benchmarks still worth running? | The write-up: staleness, a cross-dataset quality scorecard, and the multilingual result |
METHODOLOGY.md |
Source of truth for every reported number; metric validation; limits |
ETHICS.md |
Excluded categories, redaction, disclosure, provider ToS |
docs/ROADMAP.md |
The foundry pivot, phase status, and follow-up hardening |
CONTRIBUTING.md |
Scope, dev setup, and the ethics rules for adding corpora |
CHANGELOG.md |
Release history |
reports/samples/ |
Committed real-data findings (staleness, defence comparison, data card) |
Citation
@software{redteam_foundry_2026,
title = {redteam-foundry: An adversarial benchmark foundry for LLM safety},
author = {Ross},
year = {2026},
url = {https://github.com/rosscyking1115/redteam-foundry}
}
Licence
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file redteam_foundry-0.2.0.tar.gz.
File metadata
- Download URL: redteam_foundry-0.2.0.tar.gz
- Upload date:
- Size: 296.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
515bd7f5e0e9303ae30b0931df5e13490921ac589104cd139eb80585b5a137b0
|
|
| MD5 |
345abe9450419026d5a726a150ae59f7
|
|
| BLAKE2b-256 |
935d0a130809db2821fc2afda0824db1d5af45f029c1a8df52dadf42216584d8
|
File details
Details for the file redteam_foundry-0.2.0-py3-none-any.whl.
File metadata
- Download URL: redteam_foundry-0.2.0-py3-none-any.whl
- Upload date:
- Size: 104.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9d0fa7f22a22e5d05379d594ea62c7f8917e7c6824596b31322132ce2f2542e
|
|
| MD5 |
53953dcb62eb0a49a9c2b23b5eefedde
|
|
| BLAKE2b-256 |
c2358af71d98c508517132f36468030c803861a7fb0fe7f7283465b6fdf27e09
|