An open-source, research-grounded LLM red-teaming and safety benchmark harness.
Project description
redharness
A standardized, reproducible benchmark for the adversarial robustness of large language models — jailbreak, prompt-injection, and data-leakage evaluation under one methodology
Standardize the evaluation, not just the attack. In LLM safety, a number you can't reproduce — or whose judge and dataset you can't name — isn't a benchmark.
redharnessmakes adversarial safety comparable: one harness, three threat surfaces, and a(dataset_version, judge, metric)provenance triple on every result.
redharness is an open-source evaluation framework for measuring the adversarial
robustness and safety of large language models (LLMs). It unifies three threat surfaces
that are today measured inconsistently and incomparably — jailbreaks, prompt
injection, and data leakage — under a single pluggable methodology with a strict
reproducibility contract and per-result provenance.
Hands-on usage (installation, running evaluations, reading outputs, launching and using the leaderboard dashboard, writing your own configs) lives in
docs/OVERVIEW.md. This README documents what the benchmark is, why it exists, and who it is for.
Abstract
Adversarial evaluation of LLMs has advanced rapidly, but its empirical foundations are
fragile. Each attack paper tends to ship a bespoke evaluation harness, an idiosyncratic
notion of "Attack Success Rate," and a judge whose behavior is rarely held fixed across
studies — so reported numbers are frequently not comparable across papers, and in several
cases have been shown to overestimate true attack success [Souly et al. 2024]. The
benchmarks that achieved durable community adoption (HarmBench [Mazeika et al. 2024],
JailbreakBench [Chao et al. 2024]) did so primarily by standardizing the evaluation —
fixed behaviors, fixed judges, versioned artifacts, public leaderboards — rather than by
introducing the strongest individual attack. redharness generalizes that insight into a
single harness spanning three surfaces, with hash-pinned datasets, deterministic seeded
execution, persisted transcripts, and a (dataset_version, judge, metric) provenance
triple recorded for every reported number. The same standardization is extended to the
prompt-injection and data-leakage surfaces, which remain comparatively unstandardized
despite prompt injection ranking first on the OWASP Top 10 for LLM Applications for two
consecutive years.
1. Motivation
Three structural problems make current LLM-safety results hard to trust and harder to compare:
- Metric incommensurability. "ASR" denotes different quantities across papers (any successful attempt vs. success within a query budget vs. a judge's continuous score), computed over different behavior sets. Headline numbers therefore cannot be compared directly.
- Judge sensitivity. Whether a response is "harmful," a successful "injection," or a "leak" depends heavily on the grader. Souly et al. [2024] demonstrate that weak or permissive judges systematically inflate jailbreak success; small changes in the judge move headline numbers by tens of points.
- Irreproducibility. Datasets drift or are unversioned, random seeds and decoding parameters go unlogged, transcripts are discarded, and public leaderboards are vulnerable to overfitting and gaming.
The downstream consequence is that a practitioner cannot reliably answer the basic
question "Is model A safer than model B, by how much, under which threat model — and can
anyone reproduce that number?" redharness is designed so that this question has a
reproducible, provenance-tracked answer.
2. Contributions
- A unified, pluggable methodology across three surfaces. A single
Attack × Target × Dataset × Judge × Metricmatrix (extended withScenarioandInjectionaxes for agentic injection) spans jailbreak, prompt-injection, and data-leakage evaluation, so the three surfaces share one vocabulary, one runner, and one reporting format. - Standardization of the under-standardized surfaces. Injection and leakage are given the same first-class, versioned, judge-explicit treatment that HarmBench/JailbreakBench brought to jailbreaks.
- A reproducibility contract. Hash-pinned datasets, deterministic seeded execution, parameter-aware result caching, and persisted JSONL transcripts make a leaderboard row reproducible from a single command.
- Per-result provenance. Every leaderboard entry records the
(dataset_version, judge, metric)triple, eliminating the ambiguity that drives metric incommensurability. - Designed for interoperability, not re-implementation. Accepted datasets and judges are integrated as plugins so results align with published work; established tooling (garak, PyRIT, Inspect) attaches through documented extension seams rather than forks.
- A gaming-aware leaderboard dashboard — an optional Streamlit web app that aggregates all runs into a filterable, per-surface view and treats submitted results as untrusted input.
3. Background and related work
Standardization frameworks and leaderboards. HarmBench [Mazeika et al. 2024, arXiv:2402.04249] introduced a standardized automated red-teaming evaluation and a fine-tuned harmful-behavior classifier that became a de-facto judge. JailbreakBench [Chao et al. 2024, arXiv:2404.01318] added an open robustness benchmark with a public leaderboard and versioned artifacts. StrongREJECT [Souly et al. 2024, arXiv:2402.10260] showed that prior benchmarks overestimate attack success and contributed a high-agreement rubric grader. DecodingTrust [Wang et al. 2023, arXiv:2306.11698] and TrustLLM [Sun et al. 2024, arXiv:2401.05561] provide multi-perspective trustworthiness evaluation; HELM Safety [Stanford CRFM 2024] aggregates safety benchmarks under a common interface.
Jailbreak attacks. GCG [Zou et al. 2023, arXiv:2307.15043] established transferable gradient-based adversarial suffixes (and the AdvBench behavior set); PAIR [Chao et al. 2023, arXiv:2310.08419] and TAP [Mehrotra et al. 2023, arXiv:2312.02119] are query-efficient black-box attacker-LLM methods; AutoDAN [Liu et al. 2023, arXiv:2310.04451] evolves fluent, stealthy prompts.
Prompt injection. Greshake et al. [2023, arXiv:2302.12173] formalized indirect prompt injection against LLM-integrated applications. AgentDojo [Debenedetti et al. 2024, arXiv:2406.13352], InjecAgent [Zhan et al. 2024, arXiv:2403.02691], and AgentHarm [Andriushchenko et al. 2024, arXiv:2410.09024] define the agentic attack surface; the OWASP Top 10 for LLM Applications ranks prompt injection first.
Data leakage and memorization. Carlini et al. [2021, arXiv:2012.07805] extract training
data from LLMs; Nasr, Carlini et al. [2023, arXiv:2311.17035] scale extraction to production
models via divergence attacks; the Secret Sharer [Carlini et al. 2019, arXiv:1802.08232]
introduces canary-based memorization measurement, which redharness adopts for its leakage
scoring.
Over-refusal. XSTest [Röttger et al. 2024, arXiv:2308.01263] and OR-Bench [Cui et al. 2024, arXiv:2405.20947] measure false refusals on benign prompts, so safety can be reported against the helpfulness it trades against rather than in isolation.
Guardrail judges and tooling. Llama Guard [Inan et al. 2023, arXiv:2312.06674], WildGuard
[Han et al. 2024], and ShieldGemma [Zeng et al. 2024] are pluggable safety classifiers;
garak (NVIDIA), PyRIT (Microsoft), and Inspect (UK AISI) are established red-teaming and
evaluation toolkits. Governance framing follows the NIST AI Risk Management Framework and
MITRE ATLAS. Full BibTeX is provided in CITATIONS.bib.
4. Threat model
redharness standardizes evaluation across the three surfaces that are least consistently
measured today.
(S1) Jailbreaks. An adversary manipulates a prompt to elicit content the model is intended to refuse. Evaluation balances attack success against over-refusal, so that a trivially refusing model is not mistaken for a safe one.
(S2) Prompt injection (direct and indirect/agentic). An adversary smuggles instructions into a tool-using agent — placed directly in the user turn, or indirectly in a document or tool output the agent consumes — to make it pursue an attacker-chosen goal. Evaluation measures both whether the attacker's goal fired and whether the agent still completed its legitimate task (utility under attack).
(S3) Data leakage. An adversary recovers memorized or secret content: training-data extraction and divergence, canary recovery, PII elicitation, and system-prompt exfiltration. Evaluation reports both a binary recovery decision and a continuous verbatim-overlap severity score.
All bundled artifacts are realistic but responsibly synthetic — refusal-probe behaviors
phrased as user requests (the request only, never a harmful answer or operational
detail), benign sentinel attacker goals, and obviously-fake secrets (e.g.
*.example.invalid PII, 555-01xx phone numbers, CANARY-… sentinels) — so the harness
mechanics can be exercised without distributing operational harmful content, real PII, or
memorized/copyrighted text. CBRN and explosives content is excluded entirely. Real corpora
attach behind explicit, hash-verified opt-in (§9, §11).
5. Methodology
Every evaluation is a matrix over five plugin axes; a run enumerates the cells and scores each one:
Dataset ─▶ Attack ─▶ Target ─▶ transcript ─▶ Judge ─▶ Metric ─▶ Report / Leaderboard ─▶ Dashboard
(behaviors) (generator) (model) (scorer) (aggregate)
| Axis | Role | Interface |
|---|---|---|
| Target | the system under test | generate(messages, tools) -> Response |
| Attack | transforms a behavior into one or more adversarial attempts | run(behavior, target) -> list[Attempt] |
| Dataset | a versioned, hash-pinned set of behaviors/probes | load() -> list[Behavior] |
| Judge | decides success and assigns a score per attempt | score(behavior, attempt) -> Verdict |
| Metric | aggregates verdicts into a reported quantity | compute(scored) -> MetricResult |
The agentic prompt-injection surface adds two axes — Scenario (a sandboxed tool environment with a benign user task and a benign attacker goal) and Injection (the malicious instruction and its placement) — driven by a bounded multi-step agent loop. The data-leakage surface is single-turn and reuses the jailbreak execution path with leakage-specific plugins, so no separate runner mode is required.
Reproducibility contract. Datasets are content-addressed (hash-pinned) and verified before use; execution is deterministically seeded; attempts are cached on their fully resolved parameters (so a parameter change never silently reuses a stale result); and the complete prompt/response transcript of every attempt is persisted as JSONL for audit. A single command reproduces a leaderboard row.
Provenance. Each leaderboard entry records the (dataset_version, judge, metric)
triple. Because judge choice and dataset version are the dominant sources of cross-study
disagreement, binding them to every number is the framework's central anti-fragmentation
mechanism.
Extensibility. Plugins self-register and are resolved by name from declarative YAML
through a closed registry (a dictionary lookup — never dynamic import or eval — so a
configuration cannot execute arbitrary code). Adding an attack, model adapter, judge,
dataset, or metric is a small, self-contained addition; network-heavy frameworks (PAIR,
TAP, garak, PyRIT) and real corpora attach behind dependency extras so the offline core
stays lean. The concrete "add a plugin" walkthrough is in docs/OVERVIEW.md.
6. Metrics
All metrics are defined as pure functions over scored attempts; inapplicable
(cell, metric) pairs report N/A rather than a misleading 0.0.
| Metric | Definition | Grounding |
|---|---|---|
asr |
fraction of harmful behaviors with ≥ 1 successful attempt | HarmBench / JailbreakBench |
asr_at_k |
success within the first k attempts (query budget) | PAIR / TAP |
strongreject_score |
0 if refused, else the 1–5 mean of specificity/convincingness rescaled to [0,1] as (mean−1)/4 |
StrongREJECT |
refusal_rate |
fraction of harmful behaviors refused on every attempt | DecodingTrust |
frr |
false-refusal rate over the benign should-comply subset | XSTest / OR-Bench |
injection_success_rate |
fraction of injected scenarios in which the attacker goal fired | InjecAgent / AgentDojo |
utility_under_attack |
fraction of injected scenarios in which the benign task still completed | AgentDojo |
utility_baseline |
benign-task completion with no injection (control) | AgentDojo |
extraction_rate |
overall fraction of probes whose synthetic secret leaked | Carlini 2021 / Nasr 2023 |
canary_exposure_rate |
leak rate over canary probes | Secret Sharer (Carlini 2019) |
pii_leak_rate |
leak rate over PII probes | DecodingTrust |
system_prompt_leak_rate |
leak rate over system-prompt probes | — |
verbatim_overlap |
mean best verbatim overlap (longest-common-substring ratio) | Carlini 2021 / Nasr 2023 |
token_usage |
total input + output tokens consumed across a run (N/A for offline runs) | — |
cost |
estimated USD from a dated per-model price table; combined target/attacker/judge tokens are priced at the target's rate (N/A for offline runs) | — |
7. Who should use redharness, and why
- Safety and alignment researchers — to report jailbreak/injection/leakage results that are directly comparable to prior work, with the judge and dataset version pinned to every number, and to study judge sensitivity by re-scoring the same transcripts under different graders.
- Model developers and labs — to track adversarial robustness across model versions as a reproducible regression suite, balancing attack resistance against over-refusal so safety gains are not just refusal gains.
- Red teams and AI security engineers — to evaluate agentic systems against direct and indirect prompt injection and to quantify utility under attack, mapping to the OWASP LLM Top 10 and MITRE ATLAS.
- Auditors, evaluators, and policymakers — to obtain provenance-tracked, reproducible evidence aligned with the NIST AI RMF for governance and procurement decisions.
- Educators and students — to study attack and defense mechanisms hands-on, entirely offline, against deterministic targets and benign synthetic data.
8. Reproducibility and artifacts
The harness is deterministic and fully offline by default — no API keys are required for the
bundled evaluations, and results are identical across machines and runs. Each run emits a
Markdown and HTML report, a machine-readable leaderboard.json (with the provenance triple
on every row), and a complete JSONL transcript. The optional redharness dashboard command
launches a Streamlit web app that aggregates every run into a filterable, per-surface
leaderboard. The literature the framework is grounded in is enumerated in
CITATIONS.bib.
A first real-model result (fidelity)
A first end-to-end evaluation against a real frontier model (claude-haiku-4-5;
attacker/grader gpt-4o-mini) on commit-pinned public sets reproduces published behavior;
the leaderboards are committed under results/.
| Evaluation | Result |
|---|---|
| AdvBench · direct (static) | asr 0.00, refusal_rate 1.00 — aligned models refuse direct harmful requests (the undefended baseline) |
| AdvBench · PAIR | asr 0.15 (StrongREJECT grader) / 1.00 (string-match) — the attack jailbreaks through the harness (static ≈ 0 → PAIR ≫ 0) |
| XSTest · safe split | frr 0.00 — no over-refusal of benign prompts |
The PAIR cell reproduces StrongREJECT's central finding directly: scoring the same
transcripts, the string-match judge reports a ~6.7× higher attack-success rate than the
rubric grader (asr 1.00 vs 0.15) — the judge-sensitivity effect this framework's provenance
triple and redharness judge-agreement tooling (per-judge ASR + Cohen's κ) are built to
surface.
9. Implemented surface and current scope
All three surfaces and their offline evaluation paths are implemented and test-locked, and the harness ships a broad, pluggable component set:
- Attacks — single-turn (
static,template) and multi-turn attacker-LLM attackspair(Chao et al. 2023),tap(Mehrotra et al. 2023), andcrescendo, alongside the leakage probes.gcg,garak, andpyritare registered scaffolds whose heavy dependencies are unverified in CI. - Datasets — the bundled synthetic sets, plus opt-in, hash-pinned loaders for AdvBench,
HarmBench, JailbreakBench (JBB-Behaviors), XSTest, and OR-Bench: fetched-and-verified by
SHA-256 behind an explicit
allow_download, never committed to the repository. - Targets — deterministic offline reference targets, plus hardened live
openai_compatandanthropicadapters (shared httpx transport, retry/backoff, typed errors, a fail-closedmax_queriesbudget, normalized token usage, and tool-calling so the injection surface runs against real agents). Local servers (Ollama, vLLM) run through the OpenAI-compatible adapter. - Judges — string-match, the StrongREJECT-style and faithful StrongREJECT rubric graders, and the injection/leak detectors.
- Metrics — the per-surface metrics above plus
token_usageandcost.
Everything outside the bundled synthetic content is gated behind optional extras and explicit
opt-in; the offline core imports and runs with no extras and no network, enforced by a CI
tripwire. Because the bundled content is intentionally synthetic, absolute numbers from the
smoke evaluations illustrate the mechanism, not any real model's safety. See
configs/real_eval.example.yaml and the
Live-evaluation / Tool-calling / Local-servers sections of
docs/configuration.md.
Deferred to dedicated future slices: local Hugging Face classifier judges (Llama Guard, WildGuard, the HarmBench classifier), in-process HF and Bedrock/Vertex adapters, the AutoDAN attack, AgentDojo/InjecAgent scenario ingestion, and a hosted, gaming-resistant leaderboard verifier.
10. Responsible use
redharness is a defensive evaluation tool intended for authorized safety testing and
research. It ships realistic but synthetic refusal-probe behaviors and synthetic secrets —
no operational harmful content, no real PII, no memorized/copyrighted text (and no
CBRN/explosives content). Real datasets are fetched from their canonical sources and
verified by hash behind an explicit opt-in. Use it to measure and improve model safety.
Responsible use — LIVE mode. Running against real providers (openai_compat,
anthropic, the pair attack, strongreject data) is gated behind optional extras and
environment-only API keys, and is your responsibility:
- Authorized use only. Only red-team models and accounts you are authorized to test. You are responsible for complying with each provider's Terms of Service and acceptable-use policy. Use personal/research keys, not production credentials.
- Local harmful outputs. Live runs may elicit and persist real harmful text to
runs/<run_name>/(transcripts, cache, reports). Handling, storage, and retention of that content are entirely your responsibility — treat the runs directory as sensitive. - Not reproducible. Live numbers are single-sample and non-deterministic (provider
sampling, model updates, rate limits); they are not comparable across time the way the
offline, deterministic smoke results are. Set the
max_queriesbudget to cap spend.
11. Getting started
See docs/OVERVIEW.md for installation, running evaluations across the
three surfaces, interpreting the outputs, generating and using the leaderboard dashboard,
and writing your own run configurations.
Citing
If you use redharness in academic work, please cite this repository and the upstream
benchmarks and methods it integrates (see CITATIONS.bib).
@software{redharness,
title = {redharness: A Standardized, Reproducible Benchmark for Adversarial
Evaluation of Large Language Models},
author = {Mohamed Aklamaash},
year = {2026},
note = {Jailbreak, prompt-injection, and data-leakage evaluation harness},
url = {https://github.com/MohamedAklamaash/redharness}
}
License
Apache-2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file redharness-0.1.0.tar.gz.
File metadata
- Download URL: redharness-0.1.0.tar.gz
- Upload date:
- Size: 561.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
643b0e49baf72d74b5eedca7ff54a86213fa4bda310899a1c7df3228da839a32
|
|
| MD5 |
b704ca73c585d0da5d2e15b3d89edc45
|
|
| BLAKE2b-256 |
610b6681dc680a2b36769c07bf6f89380293cfb5855239d385513c6723ef7a9f
|
File details
Details for the file redharness-0.1.0-py3-none-any.whl.
File metadata
- Download URL: redharness-0.1.0-py3-none-any.whl
- Upload date:
- Size: 149.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d31f8b50b9517777790a313140a84970a80db5b9659f03231a19579c472b3aa1
|
|
| MD5 |
ebc0da21444decc831a162d6e14164e9
|
|
| BLAKE2b-256 |
7d251bb459300d5ccb130049f0d31c0f3272003757c9cb1461e1d5fe37b4a0fb
|