An adversarial benchmark foundry for LLM safety: audit attack corpora, score benchmark staleness, compare defences (ASR + false-refusal), test multilingual over-refusal, and export safe challenge packs.

These details have not been verified by PyPI

Project links

Project description

redteam-foundry

An adversarial benchmark foundry for LLM safety: audit attack corpora, measure defence impact, score benchmark staleness, test multilingual over-refusal, and export safe challenge packs — all with judge-validated, reproducible numbers. A measurement tool, not a weapon (see ETHICS.md).

Positioning

redteam-foundry is the upstream research layer of a two-layer AI-safety stack. It validates adversarial corpora, measures defence effectiveness, studies whether published benchmarks still measure real deployment risk, and exports safe challenge packs for downstream release-gating systems to consume.

It deliberately does not make production release decisions. Ship / warn / block, incident replay, and policy-as-code gates belong in a separate deployment layer — see Relationship to agent-release-gates.

Validate the benchmark before you trust the gate.

	This repo (research layer)	A release-gate layer
Job	Discover, validate, and package adversarial benchmarks	Replay incidents, apply policy, decide ship/warn/block
Output	Audited corpora, judge-validated ASR/defence measurements, challenge packs	Deployment evidence, release decisions
Question	"Is this benchmark still meaningful, and how much do I trust the score?"	"Is this agent safe to ship right now?"

What it does

Runs published adversarial prompts (AdvBench, JailbreakBench, HarmBench, AgentDojo — each pinned to an upstream commit) against target LLMs through composable, togglable defence stacks, and reports attack-success rate (ASR) with bootstrap confidence intervals and real API cost.
Validates its own numbers: every verdict is scored by an LLM judge and re-scored by an independent second judge; agreement (Cohen's κ, Krippendorff's α) is a first-class output.
Audits corpus quality: exact + near-duplicate detection (including cross-source overlap), language/script coverage, attack-family markers, and label-integrity checks → a quality report and a data card.
Scores benchmark staleness: a transparent, component-broken-out heuristic answering "is this a robust model, or a stale benchmark?".
Measures over-blocking: a benign control set (English + Traditional/ Simplified Chinese, Japanese, Korean, and code-switched) yields false-refusal rate (FRR) and a combined safe-usefulness score per defence.
Exports challenge packs: versioned, self-describing fixtures a downstream release gate can consume — with adversarial prompts redacted by default.
Interoperates: any run exports to a UK AISI Inspect eval log.

Headline finding

Across 2 target models, 2 benchmark families, and up to 4 composable defence configurations (12 evaluation cells), published adversarial prompts succeed between 0% and 4% of the time, and a paranoid prompt-only defence stack does not measurably move that number. Judge agreement on attack success is perfect — Cohen's κ = +1.00 in all 12 cells.

Attack success rate across all 12 evaluation cells — point estimate with 95% bootstrap confidence interval. Ten of the twelve cells sit at 0%; the two non-zero cells are the AdvBench Llama baseline at 1% and the AgentDojo Llama baseline at 4%.

[!NOTE] A near-zero ASR is a result about the benchmark, not just the model. It can mean the model is robust or the benchmark is stale — and ASR alone cannot tell them apart. Measuring that difference (staleness, defence sensitivity, multilingual over-refusal) is what this foundry is for. Full numbers, validation, and limits are in METHODOLOGY.md.

Getting started

git clone https://github.com/rosscyking1115/redteam-foundry.git
cd redteam-foundry
uv venv --python 3.13
source .venv/bin/activate            # macOS/Linux
# .venv\Scripts\activate             # Windows PowerShell
uv pip install -e ".[dev]"
cp .env.example .env                 # fill in ANTHROPIC_API_KEY
pre-commit install
pytest tests/unit                    # should pass green
redteam version                      # prints the installed version

The CLI is redteam ... (equivalently python -m redteam ...).

[!WARNING] Live runs call paid APIs. Each run enforces a hard USD budget cap (set per config in configs/), and the judge/target adapters enforce a per-call cap — but set a matching console budget cap before your first run anyway.

Commands

Benchmark research (the foundry) — offline, no API key

These analyse corpora and existing run artifacts; they need cached corpora but no live model calls.

# Audit corpora: duplicates, cross-source overlap, language + attack-family
# coverage, label issues -> quality report + data card + JSON.
redteam corpora audit --output reports/corpus_audit/

# Audit ANY Hugging Face adversarial dataset, not just the built-in four.
redteam corpora audit-hf --dataset owner/name --prompt-column prompt --revision <sha>

# Score benchmark staleness (heuristic). Pass --run for evaluation JSONs to
# light up the run-based components (universal-low-ASR, defence-insensitivity,
# judge-disagreement); corpus-only otherwise.
redteam corpora staleness --only agentdojo --run results/<run>.cross-judged.json

# Compare defences on ASR, false-refusal rate, safe-usefulness, cost, latency.
redteam compare-defences --run results/<adv>.judged.json --benign-run results/<benign>.json

# False-refusal rate broken down by language (over a benign run).
redteam frr-by-language --run results/<benign_multilingual>.json

# Export a versioned challenge pack (adversarial prompts redacted by default).
redteam export-pack --pack-id my-pack --only advbench

# Write the benign control sets to JSONL for inspection / running.
redteam benign export                 # English control set
redteam benign export --multilingual  # zh-Hant/zh-Hans/ja/ko + code-switch

Measurement core — needs an API key / local Ollama

redteam corpora download                                   # fetch + pin corpora
redteam run --config configs/run_anthropic_baseline.yaml   # evaluate
redteam score --run results/<run>.json                     # LLM-judge scoring
redteam cross-judge --run results/<run>.judged.json        # second judge + agreement
redteam export-inspect --run results/<run>.json            # UK AISI Inspect log

Run redteam --help for the full command list; every sub-command has --help.

Why this reports ASR and not refusal rate

The cross-judge layer found that ASR is well-posed and refusal_rate is not: the two judges agree perfectly on whether an attack succeeded, but disagree — sometimes worse than chance — on whether a response was a "refusal", because an indirect-injection task has two things that can be refused (the user's request and the injected instruction). refusal_rate is therefore reported as a descriptive signal of response style only, never as a safety metric. This is documented, not hidden — see METHODOLOGY.md §7.

Relationship to agent-release-gates

This repository is the upstream adversarial benchmark layer: it validates static attack corpora, measures defence stacks, scores its own reliability, and exports safe challenge packs. Production release decisions — incident replay, policy-as-code gates, deployment evidence, and ship / warn / block recommendations — are deliberately out of scope. A useful mental model:

redteam-foundry discovers, validates, and packages adversarial scenarios.
a release-gate layer (agent-release-gates) consumes selected scenarios as regression and release-readiness checks.

A benchmark research tool should not be the thing that decides whether an agent ships, and a release gate is only as trustworthy as the benchmarks feeding it. (agent-release-gates is a companion project; this section documents the intended split.)

Ethics

[!IMPORTANT] This project uses only published adversarial prompts and does not generate novel jailbreaks in any language. Excluded categories (CSAM, weapons-of-mass-destruction synthesis, detailed self-harm methods) are filtered at corpus-load time and verified by a CI test. Results are aggregate; exported adversarial prompts are redacted. The multilingual work is benign-only. Full policy in ETHICS.md.

If you are a model provider whose model is included and want example transcripts removed, email rosscyking@gmail.com — 24-hour removal commitment.

Development

scripts/ci_local.ps1 (Windows) and scripts/ci_local.sh (Linux/macOS) run the exact same checks as CI — ruff lint, ruff format check, mypy, pytest. Green locally means green on the PR. Run artifacts (results/), audit outputs (reports/), and non-sample packs (challenge_packs/) are gitignored — all re-creatable from configs.

Documentation

File	What's in it
Finding: are jailbreak benchmarks still worth running?	The write-up: staleness, a cross-dataset quality scorecard, and the multilingual result
`METHODOLOGY.md`	Source of truth for every reported number; metric validation; limits
`ETHICS.md`	Excluded categories, redaction, disclosure, provider ToS
`docs/ROADMAP.md`	The foundry pivot, phase status, and follow-up hardening
`CONTRIBUTING.md`	Scope, dev setup, and the ethics rules for adding corpora
`CHANGELOG.md`	Release history
`reports/samples/`	Committed real-data findings (staleness, defence comparison, data card)

Citation

@software{redteam_foundry_2026,
  title  = {redteam-foundry: An adversarial benchmark foundry for LLM safety},
  author = {Ross},
  year   = {2026},
  url    = {https://github.com/rosscyking1115/redteam-foundry}
}

Licence

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jul 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redteam_foundry-0.2.0.tar.gz (296.8 kB view details)

Uploaded Jul 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

redteam_foundry-0.2.0-py3-none-any.whl (104.6 kB view details)

Uploaded Jul 2, 2026 Python 3

File details

Details for the file redteam_foundry-0.2.0.tar.gz.

File metadata

Download URL: redteam_foundry-0.2.0.tar.gz
Upload date: Jul 2, 2026
Size: 296.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for redteam_foundry-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`515bd7f5e0e9303ae30b0931df5e13490921ac589104cd139eb80585b5a137b0`
MD5	`345abe9450419026d5a726a150ae59f7`
BLAKE2b-256	`935d0a130809db2821fc2afda0824db1d5af45f029c1a8df52dadf42216584d8`

See more details on using hashes here.

File details

Details for the file redteam_foundry-0.2.0-py3-none-any.whl.

File metadata

Download URL: redteam_foundry-0.2.0-py3-none-any.whl
Upload date: Jul 2, 2026
Size: 104.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for redteam_foundry-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c9d0fa7f22a22e5d05379d594ea62c7f8917e7c6824596b31322132ce2f2542e`
MD5	`53953dcb62eb0a49a9c2b23b5eefedde`
BLAKE2b-256	`c2358af71d98c508517132f36468030c803861a7fb0fe7f7283465b6fdf27e09`

See more details on using hashes here.

redteam-foundry 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

redteam-foundry

Positioning

What it does

Headline finding

Getting started

Commands

Benchmark research (the foundry) — offline, no API key

Measurement core — needs an API key / local Ollama

Why this reports ASR and not refusal rate

Relationship to agent-release-gates

Ethics

Development

Documentation

Citation

Licence

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes