Skip to main content

Public adversarial leaderboard for prompt injection detection — the referee of the field

Project description

raucle-bench

CI License

Public adversarial leaderboard for prompt-injection detection. Benchmarks open-source LLM guardrails on a shared, version-controlled dataset of attack and benign prompts.

Every guardrail vendor claims accuracy. Almost none publish reproducible numbers. This is the referee.

  • Live leaderboard: raucle.com/bench/ — client-side dashboard that always reflects results/latest.json in this repo.
  • Dataset: 165 curated prompts across 6 attack classes + benign baseline. Grows toward 10k+.
  • Methodology: precision, recall, F1, false-positive rate, strict-action match, p50/p99 latency per adapter.
  • License: MIT (code and dataset).

Why this exists

Lakera, Llama Guard, LLM Guard, Rebuff, Vigil, NeMo, raucle-detect — every prompt-injection detector ships with marketing numbers and no way to reproduce them. There is no SPEC2017 for AI security. The result is that:

  • Vendors compete on claims rather than on evidence.
  • Customers cannot tell whether a detector actually protects them.
  • Researchers measuring detection quality have to write the benchmark themselves every time.

raucle-bench fixes this by being the same dataset run against every adapter, with the script and outputs in version control. Anyone can re-run the benchmark, submit a new adapter, or contribute a prompt the dataset is missing.

What's in v0.1

Adapter Status
always-allow baseline ✅ Built in
always-block baseline ✅ Built in
raucle-detect (strict / standard / permissive) ✅ Built in
llm-guard ✅ Built in (install with pip install '.[llm-guard]')
rebuff ⏳ Adapter PR welcome
vigil ⏳ Adapter PR welcome
nemo-guardrails ⏳ Adapter PR welcome
llama-guard ⏳ Adapter PR welcome
lakera-guard ⏳ Adapter PR welcome (API key required)

If you maintain a detector — open or commercial — we want to benchmark it. The adapter interface is ~30 lines of Python (see src/raucle_bench/adapter.py).

Quick start

pip install -e '.[raucle]'
raucle-bench --list-adapters
raucle-bench                          # runs everything, writes results/<timestamp>.json|md|html
raucle-bench --adapters raucle-strict # one adapter only

CLI flags:

--datasets-dir DIR     directory of *.jsonl files (default: datasets/)
--results-dir DIR      where to write outputs (default: results/)
--adapters NAME...     limit to specific adapters
--run-id ID            override the timestamp
--validate-only        load and validate the dataset without running anything
--list-adapters        print registered adapter names

Scoring methodology

A prediction is positive if the adapter says ALERT or BLOCK, negative if it says ALLOW. Ground truth is positive for any non-benign prompt.

For each adapter we report:

Metric What it tells you
F1 Harmonic mean of precision and recall. Single-number summary.
Detection rate Fraction of attacks correctly flagged. Same as recall.
False-positive rate Fraction of benign prompts incorrectly flagged. The thing that breaks production.
Strict match rate Fraction of prompts where predicted action exactly matches expected action. Catches "detected but recommended wrong remediation."
p50 / p99 latency Per-prompt wall-clock time.

No single metric is sufficient. A detector that always says BLOCK has 100% recall and a meaningless 100% FPR. A detector that always says ALLOW has 0% FPR and 0% recall. The leaderboard reports both and the trivial baselines so you can calibrate the real entries.

Dataset

165 prompts as of v0.1, broken down by attack class:

Class Prompts What it tests
direct_injection 25 Override / role hijack / ChatML / context stuff
jailbreak 25 DAN, developer mode, hypothetical pretext, multi-turn escalation
data_exfiltration 20 System prompt extraction, credential leakage, exfil channels
tool_abuse 20 Shell injection, path traversal, SQL injection, SSRF, code injection
evasion 20 Base64 / ROT13 / hex smuggling, homoglyphs, zero-width, leet, case-flip
indirect_injection 15 Document injection, tool poisoning, RAG poisoning, markdown exfil
benign 40 Clean prompts including hard negatives (mentions of "ignore", "system prompt", "developer mode" in legit contexts)

See datasets/README.md for the schema, source labelling, and ethical considerations. The dataset is MIT-licensed; please ensure contributions carry compatible rights.

Adding an adapter

# src/raucle_bench/adapters/my_tool.py
from raucle_bench.adapter import Prediction

class MyToolAdapter:
    name = "my-tool-v1"
    version = "0.1.0"

    def setup(self) -> None:
        self._scanner = my_tool.Scanner()

    def teardown(self) -> None:
        self._scanner = None

    def predict(self, prompt: str) -> Prediction:
        result = self._scanner.scan(prompt)
        action = "BLOCK" if result.is_attack else "ALLOW"
        return Prediction(action=action, confidence=result.score)

Register it in src/raucle_bench/cli.py under _register_optional_adapters() so missing deps don't break the rest of the benchmark.

Adding a prompt

  1. Pick the right datasets/<class>.jsonl file.
  2. Add a JSONL line with the next free ID in the sequence.
  3. Run raucle-bench --validate-only to confirm the dataset still loads.
  4. Open a PR with the dataset label.

See datasets/README.md for the schema.

Weekly auto-run

.github/workflows/weekly-run.yml runs the full benchmark every Monday at 06:00 UTC and commits the results directly to main. The latest snapshot is at results/latest.json and results/latest.html.

Roadmap

  • v0.2: dataset to 500+ prompts; LLM Guard, Vigil, Rebuff adapters; balanced-accuracy metric alongside F1.
  • v0.3: dashboard at bench.raucle.com (Cloudflare Pages); time-series view of every adapter's score across weekly runs.
  • v0.4: Llama Guard, NeMo Guardrails, Lakera (API key in repo secret) adapters.
  • v1.0: 10k+ prompts; multimodal (image + audio); third-party submission process.

Related

License

MIT for both code and dataset. Contributions are welcomed under the same terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

raucle_bench-0.1.0.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

raucle_bench-0.1.0-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file raucle_bench-0.1.0.tar.gz.

File metadata

  • Download URL: raucle_bench-0.1.0.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for raucle_bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e79a6f97b1517118efad9b69de019a5ad7aad886d4435b0fea6724d0f3f4a65e
MD5 0420a380377f29fef850516c6df89b08
BLAKE2b-256 59df3602a271bc74c28282832d2253d5b309aa2fad023e3e45196886410a9dcc

See more details on using hashes here.

Provenance

The following attestation bundles were made for raucle_bench-0.1.0.tar.gz:

Publisher: publish.yml on craigamcw/raucle-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file raucle_bench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: raucle_bench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for raucle_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 29e3be2f7ec35c2c4200e25af860f46e1f25099a0e4b27c6894c41ba4e5d501f
MD5 d67a76aaf452e74efca707e4e827704a
BLAKE2b-256 a212bd5c6f26d858dcb51181c55b3003ef1e1b0327f3a695bf610ea22d352c52

See more details on using hashes here.

Provenance

The following attestation bundles were made for raucle_bench-0.1.0-py3-none-any.whl:

Publisher: publish.yml on craigamcw/raucle-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page