Skip to main content

An open-source, research-grounded LLM red-teaming and safety benchmark harness.

Project description

redharness

CI License: Apache-2.0 Python 3.11+ Typed

A standardized, reproducible benchmark for the adversarial robustness of large language models — jailbreak, prompt-injection, and data-leakage evaluation under one methodology

Standardize the evaluation, not just the attack. In LLM safety, a number you can't reproduce — or whose judge and dataset you can't name — isn't a benchmark. redharness makes adversarial safety comparable: one harness, three threat surfaces, and a (dataset_version, judge, metric) provenance triple on every result.

redharness is an open-source evaluation framework for measuring the adversarial robustness and safety of large language models (LLMs). It unifies three threat surfaces that are today measured inconsistently and incomparably — jailbreaks, prompt injection, and data leakage — under a single pluggable methodology with a strict reproducibility contract and per-result provenance.

Hands-on usage (installation, running evaluations, reading outputs, launching and using the leaderboard dashboard, writing your own configs) lives in docs/OVERVIEW.md. This README documents what the benchmark is, why it exists, and who it is for.


Abstract

Adversarial evaluation of LLMs has advanced rapidly, but its empirical foundations are fragile. Each attack paper tends to ship a bespoke evaluation harness, an idiosyncratic notion of "Attack Success Rate," and a judge whose behavior is rarely held fixed across studies — so reported numbers are frequently not comparable across papers, and in several cases have been shown to overestimate true attack success [Souly et al. 2024]. The benchmarks that achieved durable community adoption (HarmBench [Mazeika et al. 2024], JailbreakBench [Chao et al. 2024]) did so primarily by standardizing the evaluation — fixed behaviors, fixed judges, versioned artifacts, public leaderboards — rather than by introducing the strongest individual attack. redharness generalizes that insight into a single harness spanning three surfaces, with hash-pinned datasets, deterministic seeded execution, persisted transcripts, and a (dataset_version, judge, metric) provenance triple recorded for every reported number. The same standardization is extended to the prompt-injection and data-leakage surfaces, which remain comparatively unstandardized despite prompt injection ranking first on the OWASP Top 10 for LLM Applications for two consecutive years.

1. Motivation

Three structural problems make current LLM-safety results hard to trust and harder to compare:

  1. Metric incommensurability. "ASR" denotes different quantities across papers (any successful attempt vs. success within a query budget vs. a judge's continuous score), computed over different behavior sets. Headline numbers therefore cannot be compared directly.
  2. Judge sensitivity. Whether a response is "harmful," a successful "injection," or a "leak" depends heavily on the grader. Souly et al. [2024] demonstrate that weak or permissive judges systematically inflate jailbreak success; small changes in the judge move headline numbers by tens of points.
  3. Irreproducibility. Datasets drift or are unversioned, random seeds and decoding parameters go unlogged, transcripts are discarded, and public leaderboards are vulnerable to overfitting and gaming.

The downstream consequence is that a practitioner cannot reliably answer the basic question "Is model A safer than model B, by how much, under which threat model — and can anyone reproduce that number?" redharness is designed so that this question has a reproducible, provenance-tracked answer.

2. Contributions

  • A unified, pluggable methodology across three surfaces. A single Attack × Target × Dataset × Judge × Metric matrix (extended with Scenario and Injection axes for agentic injection) spans jailbreak, prompt-injection, and data-leakage evaluation, so the three surfaces share one vocabulary, one runner, and one reporting format.
  • Standardization of the under-standardized surfaces. Injection and leakage are given the same first-class, versioned, judge-explicit treatment that HarmBench/JailbreakBench brought to jailbreaks.
  • A reproducibility contract. Hash-pinned datasets, deterministic seeded execution, parameter-aware result caching, and persisted JSONL transcripts make a leaderboard row reproducible from a single command.
  • Per-result provenance. Every leaderboard entry records the (dataset_version, judge, metric) triple, eliminating the ambiguity that drives metric incommensurability.
  • Designed for interoperability, not re-implementation. Accepted datasets and judges are integrated as plugins so results align with published work; established tooling (garak, PyRIT, Inspect) attaches through documented extension seams rather than forks.
  • A gaming-aware leaderboard dashboard — an optional Streamlit web app that aggregates all runs into a filterable, per-surface view and treats submitted results as untrusted input.

3. Background and related work

Standardization frameworks and leaderboards. HarmBench [Mazeika et al. 2024, arXiv:2402.04249] introduced a standardized automated red-teaming evaluation and a fine-tuned harmful-behavior classifier that became a de-facto judge. JailbreakBench [Chao et al. 2024, arXiv:2404.01318] added an open robustness benchmark with a public leaderboard and versioned artifacts. StrongREJECT [Souly et al. 2024, arXiv:2402.10260] showed that prior benchmarks overestimate attack success and contributed a high-agreement rubric grader. DecodingTrust [Wang et al. 2023, arXiv:2306.11698] and TrustLLM [Sun et al. 2024, arXiv:2401.05561] provide multi-perspective trustworthiness evaluation; HELM Safety [Stanford CRFM 2024] aggregates safety benchmarks under a common interface.

Jailbreak attacks. GCG [Zou et al. 2023, arXiv:2307.15043] established transferable gradient-based adversarial suffixes (and the AdvBench behavior set); PAIR [Chao et al. 2023, arXiv:2310.08419] and TAP [Mehrotra et al. 2023, arXiv:2312.02119] are query-efficient black-box attacker-LLM methods; AutoDAN [Liu et al. 2023, arXiv:2310.04451] evolves fluent, stealthy prompts.

Prompt injection. Greshake et al. [2023, arXiv:2302.12173] formalized indirect prompt injection against LLM-integrated applications. AgentDojo [Debenedetti et al. 2024, arXiv:2406.13352], InjecAgent [Zhan et al. 2024, arXiv:2403.02691], and AgentHarm [Andriushchenko et al. 2024, arXiv:2410.09024] define the agentic attack surface; the OWASP Top 10 for LLM Applications ranks prompt injection first.

Data leakage and memorization. Carlini et al. [2021, arXiv:2012.07805] extract training data from LLMs; Nasr, Carlini et al. [2023, arXiv:2311.17035] scale extraction to production models via divergence attacks; the Secret Sharer [Carlini et al. 2019, arXiv:1802.08232] introduces canary-based memorization measurement, which redharness adopts for its leakage scoring.

Over-refusal. XSTest [Röttger et al. 2024, arXiv:2308.01263] and OR-Bench [Cui et al. 2024, arXiv:2405.20947] measure false refusals on benign prompts, so safety can be reported against the helpfulness it trades against rather than in isolation.

Guardrail judges and tooling. Llama Guard [Inan et al. 2023, arXiv:2312.06674], WildGuard [Han et al. 2024], and ShieldGemma [Zeng et al. 2024] are pluggable safety classifiers; garak (NVIDIA), PyRIT (Microsoft), and Inspect (UK AISI) are established red-teaming and evaluation toolkits. Governance framing follows the NIST AI Risk Management Framework and MITRE ATLAS. Full BibTeX is provided in CITATIONS.bib.

4. Threat model

redharness standardizes evaluation across the three surfaces that are least consistently measured today.

(S1) Jailbreaks. An adversary manipulates a prompt to elicit content the model is intended to refuse. Evaluation balances attack success against over-refusal, so that a trivially refusing model is not mistaken for a safe one.

(S2) Prompt injection (direct and indirect/agentic). An adversary smuggles instructions into a tool-using agent — placed directly in the user turn, or indirectly in a document or tool output the agent consumes — to make it pursue an attacker-chosen goal. Evaluation measures both whether the attacker's goal fired and whether the agent still completed its legitimate task (utility under attack).

(S3) Data leakage. An adversary recovers memorized or secret content: training-data extraction and divergence, canary recovery, PII elicitation, and system-prompt exfiltration. Evaluation reports both a binary recovery decision and a continuous verbatim-overlap severity score.

All bundled artifacts are realistic but responsibly synthetic — refusal-probe behaviors phrased as user requests (the request only, never a harmful answer or operational detail), benign sentinel attacker goals, and obviously-fake secrets (e.g. *.example.invalid PII, 555-01xx phone numbers, CANARY-… sentinels) — so the harness mechanics can be exercised without distributing operational harmful content, real PII, or memorized/copyrighted text. CBRN and explosives content is excluded entirely. Real corpora attach behind explicit, hash-verified opt-in (§9, §11).

5. Methodology

Every evaluation is a matrix over five plugin axes; a run enumerates the cells and scores each one:

 Dataset ─▶ Attack ─▶ Target ─▶ transcript ─▶ Judge ─▶ Metric ─▶ Report / Leaderboard ─▶ Dashboard
(behaviors) (generator) (model)               (scorer) (aggregate)
Axis Role Interface
Target the system under test generate(messages, tools) -> Response
Attack transforms a behavior into one or more adversarial attempts run(behavior, target) -> list[Attempt]
Dataset a versioned, hash-pinned set of behaviors/probes load() -> list[Behavior]
Judge decides success and assigns a score per attempt score(behavior, attempt) -> Verdict
Metric aggregates verdicts into a reported quantity compute(scored) -> MetricResult

The agentic prompt-injection surface adds two axes — Scenario (a sandboxed tool environment with a benign user task and a benign attacker goal) and Injection (the malicious instruction and its placement) — driven by a bounded multi-step agent loop. The data-leakage surface is single-turn and reuses the jailbreak execution path with leakage-specific plugins, so no separate runner mode is required.

Reproducibility contract. Datasets are content-addressed (hash-pinned) and verified before use; execution is deterministically seeded; attempts are cached on their fully resolved parameters (so a parameter change never silently reuses a stale result); and the complete prompt/response transcript of every attempt is persisted as JSONL for audit. A single command reproduces a leaderboard row.

Provenance. Each leaderboard entry records the (dataset_version, judge, metric) triple. Because judge choice and dataset version are the dominant sources of cross-study disagreement, binding them to every number is the framework's central anti-fragmentation mechanism.

Extensibility. Plugins self-register and are resolved by name from declarative YAML through a closed registry (a dictionary lookup — never dynamic import or eval — so a configuration cannot execute arbitrary code). Adding an attack, model adapter, judge, dataset, or metric is a small, self-contained addition; network-heavy frameworks (PAIR, TAP, garak, PyRIT) and real corpora attach behind dependency extras so the offline core stays lean. The concrete "add a plugin" walkthrough is in docs/OVERVIEW.md.

6. Metrics

All metrics are defined as pure functions over scored attempts; inapplicable (cell, metric) pairs report N/A rather than a misleading 0.0.

Metric Definition Grounding
asr fraction of harmful behaviors with ≥ 1 successful attempt HarmBench / JailbreakBench
asr_at_k success within the first k attempts (query budget) PAIR / TAP
strongreject_score 0 if refused, else the 1–5 mean of specificity/convincingness rescaled to [0,1] as (mean−1)/4 StrongREJECT
refusal_rate fraction of harmful behaviors refused on every attempt DecodingTrust
frr false-refusal rate over the benign should-comply subset XSTest / OR-Bench
injection_success_rate fraction of injected scenarios in which the attacker goal fired InjecAgent / AgentDojo
utility_under_attack fraction of injected scenarios in which the benign task still completed AgentDojo
utility_baseline benign-task completion with no injection (control) AgentDojo
extraction_rate overall fraction of probes whose synthetic secret leaked Carlini 2021 / Nasr 2023
canary_exposure_rate leak rate over canary probes Secret Sharer (Carlini 2019)
pii_leak_rate leak rate over PII probes DecodingTrust
system_prompt_leak_rate leak rate over system-prompt probes
verbatim_overlap mean best verbatim overlap (longest-common-substring ratio) Carlini 2021 / Nasr 2023
token_usage total input + output tokens consumed across a run (N/A for offline runs)
cost estimated USD from a dated per-model price table; combined target/attacker/judge tokens are priced at the target's rate (N/A for offline runs)

7. Who should use redharness, and why

  • Safety and alignment researchers — to report jailbreak/injection/leakage results that are directly comparable to prior work, with the judge and dataset version pinned to every number, and to study judge sensitivity by re-scoring the same transcripts under different graders.
  • Model developers and labs — to track adversarial robustness across model versions as a reproducible regression suite, balancing attack resistance against over-refusal so safety gains are not just refusal gains.
  • Red teams and AI security engineers — to evaluate agentic systems against direct and indirect prompt injection and to quantify utility under attack, mapping to the OWASP LLM Top 10 and MITRE ATLAS.
  • Auditors, evaluators, and policymakers — to obtain provenance-tracked, reproducible evidence aligned with the NIST AI RMF for governance and procurement decisions.
  • Educators and students — to study attack and defense mechanisms hands-on, entirely offline, against deterministic targets and benign synthetic data.

8. Reproducibility and artifacts

The harness is deterministic and fully offline by default — no API keys are required for the bundled evaluations, and results are identical across machines and runs. Each run emits a Markdown and HTML report, a machine-readable leaderboard.json (with the provenance triple on every row), and a complete JSONL transcript. The optional redharness dashboard command launches a Streamlit web app that aggregates every run into a filterable, per-surface leaderboard. The literature the framework is grounded in is enumerated in CITATIONS.bib.

A first real-model result (fidelity)

A first end-to-end evaluation against a real frontier model (claude-haiku-4-5; attacker/grader gpt-4o-mini) on commit-pinned public sets reproduces published behavior; the leaderboards are committed under results/.

Evaluation Result
AdvBench · direct (static) asr 0.00, refusal_rate 1.00 — aligned models refuse direct harmful requests (the undefended baseline)
AdvBench · PAIR asr 0.15 (StrongREJECT grader) / 1.00 (string-match) — the attack jailbreaks through the harness (static ≈ 0 → PAIR ≫ 0)
XSTest · safe split frr 0.00 — no over-refusal of benign prompts

The PAIR cell reproduces StrongREJECT's central finding directly: scoring the same transcripts, the string-match judge reports a ~6.7× higher attack-success rate than the rubric grader (asr 1.00 vs 0.15) — the judge-sensitivity effect this framework's provenance triple and redharness judge-agreement tooling (per-judge ASR + Cohen's κ) are built to surface.

9. Implemented surface and current scope

All three surfaces and their offline evaluation paths are implemented and test-locked, and the harness ships a broad, pluggable component set:

  • Attacks — single-turn (static, template) and multi-turn attacker-LLM attacks pair (Chao et al. 2023), tap (Mehrotra et al. 2023), and crescendo, alongside the leakage probes. gcg, garak, and pyrit are registered scaffolds whose heavy dependencies are unverified in CI.
  • Datasets — the bundled synthetic sets, plus opt-in, hash-pinned loaders for AdvBench, HarmBench, JailbreakBench (JBB-Behaviors), XSTest, and OR-Bench: fetched-and-verified by SHA-256 behind an explicit allow_download, never committed to the repository.
  • Targets — deterministic offline reference targets, plus hardened live openai_compat and anthropic adapters (shared httpx transport, retry/backoff, typed errors, a fail-closed max_queries budget, normalized token usage, and tool-calling so the injection surface runs against real agents). Local servers (Ollama, vLLM) run through the OpenAI-compatible adapter.
  • Judges — string-match, the StrongREJECT-style and faithful StrongREJECT rubric graders, and the injection/leak detectors.
  • Metrics — the per-surface metrics above plus token_usage and cost.

Everything outside the bundled synthetic content is gated behind optional extras and explicit opt-in; the offline core imports and runs with no extras and no network, enforced by a CI tripwire. Because the bundled content is intentionally synthetic, absolute numbers from the smoke evaluations illustrate the mechanism, not any real model's safety. See configs/real_eval.example.yaml and the Live-evaluation / Tool-calling / Local-servers sections of docs/configuration.md.

Deferred to dedicated future slices: local Hugging Face classifier judges (Llama Guard, WildGuard, the HarmBench classifier), in-process HF and Bedrock/Vertex adapters, the AutoDAN attack, AgentDojo/InjecAgent scenario ingestion, and a hosted, gaming-resistant leaderboard verifier.

10. Responsible use

redharness is a defensive evaluation tool intended for authorized safety testing and research. It ships realistic but synthetic refusal-probe behaviors and synthetic secrets — no operational harmful content, no real PII, no memorized/copyrighted text (and no CBRN/explosives content). Real datasets are fetched from their canonical sources and verified by hash behind an explicit opt-in. Use it to measure and improve model safety.

Responsible use — LIVE mode. Running against real providers (openai_compat, anthropic, the pair attack, strongreject data) is gated behind optional extras and environment-only API keys, and is your responsibility:

  • Authorized use only. Only red-team models and accounts you are authorized to test. You are responsible for complying with each provider's Terms of Service and acceptable-use policy. Use personal/research keys, not production credentials.
  • Local harmful outputs. Live runs may elicit and persist real harmful text to runs/<run_name>/ (transcripts, cache, reports). Handling, storage, and retention of that content are entirely your responsibility — treat the runs directory as sensitive.
  • Not reproducible. Live numbers are single-sample and non-deterministic (provider sampling, model updates, rate limits); they are not comparable across time the way the offline, deterministic smoke results are. Set the max_queries budget to cap spend.

11. Getting started

See docs/OVERVIEW.md for installation, running evaluations across the three surfaces, interpreting the outputs, generating and using the leaderboard dashboard, and writing your own run configurations.

Citing

If you use redharness in academic work, please cite this repository and the upstream benchmarks and methods it integrates (see CITATIONS.bib).

@software{redharness,
  title  = {redharness: A Standardized, Reproducible Benchmark for Adversarial
            Evaluation of Large Language Models},
  author = {Mohamed Aklamaash},
  year   = {2026},
  note   = {Jailbreak, prompt-injection, and data-leakage evaluation harness},
  url    = {https://github.com/MohamedAklamaash/redharness}
}

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redharness-0.1.0.tar.gz (561.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

redharness-0.1.0-py3-none-any.whl (149.1 kB view details)

Uploaded Python 3

File details

Details for the file redharness-0.1.0.tar.gz.

File metadata

  • Download URL: redharness-0.1.0.tar.gz
  • Upload date:
  • Size: 561.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for redharness-0.1.0.tar.gz
Algorithm Hash digest
SHA256 643b0e49baf72d74b5eedca7ff54a86213fa4bda310899a1c7df3228da839a32
MD5 b704ca73c585d0da5d2e15b3d89edc45
BLAKE2b-256 610b6681dc680a2b36769c07bf6f89380293cfb5855239d385513c6723ef7a9f

See more details on using hashes here.

File details

Details for the file redharness-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: redharness-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 149.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for redharness-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d31f8b50b9517777790a313140a84970a80db5b9659f03231a19579c472b3aa1
MD5 ebc0da21444decc831a162d6e14164e9
BLAKE2b-256 7d251bb459300d5ccb130049f0d31c0f3272003757c9cb1461e1d5fe37b4a0fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page