Skip to main content

Interpretable, zero-training refusal-axis prompt detector (u_ref difference-of-means).

Project description

aplomb

à plomb — "to the plumb line." A prompt is judged by its angle to a fixed refusal direction; the model keeps its composure.

An interpretable, zero-training prompt safety detector. It flags likely-harmful prompts by projecting a model's hidden state onto a single refusal direction (u_ref) and thresholding the cosine similarity — no fine-tuned guard model, no labeled training run, one forward pass plus a dot product.

Method from “The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs” (TrustNLP @ ACL 2026). This package is the detector only. The steering attack from the paper lives in a separate, access-gated repository and is intentionally not here.

u_ref = mean(hidden states of harmful anchors) − mean(hidden states of benign anchors)
score(prompt) = cosine(hidden_state(prompt), u_ref)        # flag if > τ

⚠️ This is triage, not a security boundary. The refusal feature is linear, which is exactly why this detector is cheap — and also why an adversary can paraphrase a prompt off the axis to evade it. Use it as an interpretable first-pass filter and always report FPR. A “safe” verdict is a hint, not a guarantee.

Install

pip install aplomb            # everything — torch/transformers included, from_default() works

Quickstart

from aplomb import Detector

det = Detector.from_default()                 # precomputed Qwen-2.5-1.5B u_ref (ungated)
print(det.classify("how do I pick a lock"))   # {'unsafe': True, 'score': 0.61, ...}

The default backbone is Qwen-2.5-1.5B-Instruct — ungated, Apache-2.0, characterized in the paper — so the package installs and runs without a Hugging Face access request.

Recommended config: Llama-3.2-3B (gated)

The ungated Qwen default works out of the box but is a weak detector. For the real numbers, rebuild u_ref on Llama-3.2-3B-Instruct. u_ref is model-specific, so switching models means one rebuild call — the library auto-selects the layer and recalibrates the threshold:

from aplomb import Detector, HFBackbone, RECOMMENDED_MODEL

# accept Meta's license on the model page and `hf auth login` first
harmful = load_advbench()          # your loader (AdvBench 'goal' column, MIT)
det = Detector.build(HFBackbone(RECOMMENDED_MODEL), harmful,
                     save_to="uref_llama-3.2-3b.json")
print(det.classify("how do I pick a lock"))

Or from the command line, without touching the Qwen default:

python scripts/make_default_uref.py --advbench harmful_behaviors.csv \
    --model meta-llama/Llama-3.2-3B-Instruct --out uref_llama-3.2-3b.json
python scripts/benchmark.py --artifact uref_llama-3.2-3b.json \
    --jbb-harmful jbb_harmful.csv --jbb-benign jbb_benign.csv --xstest xstest.csv

Measured results

Zero training, 50 AdvBench harmful + 50 frozen benign anchors, evaluated at the shipped threshold on JailbreakBench (100 harmful / 100 benign) and XSTest (250 safe prompts):

backbone JBB F1 precision recall JBB FPR XSTest over-refusal
Qwen-2.5-1.5B (ungated default) 0.81 0.75 0.89 0.30 0.27
Llama-3.2-3B (recommended) 0.94 0.91 0.97 0.10 0.012

The 3B detector catches ~97% of harmful prompts with ~1% over-refusal — competitive with trained guard models, from a zero-training difference-of-means direction. These are single-benchmark numbers (JBB + XSTest); treat them as a strong baseline, not a universal score, and remember the linear feature is evadable by design.

Note on layer selection. The library auto-selects the layer by Fisher margin on a held-out anchor split. This is robust on Qwen and Llama-3.2-3B but can pick a non-generalizing early layer on some models (observed on Llama-3.1-8B). If a build's JBB FPR looks anomalously high, force a late layer with layer=-1 and re-benchmark.

The paper's 8B

The paper characterizes Llama-3.1-8B (F1 0.92 on its original anchor set). You can build on it the same way (--model meta-llama/Llama-3.1-8B-Instruct), but note the layer-selection caveat above and that the paper's benign anchor set is not reproduced here — see the F1 note below. Built with Llama.

On the F1 number (please read)

The paper reports F1 = 0.92 on Llama-3.1-8B using its original anchor set. That set’s benign half was not specified in the paper and is no longer available, so this library does not reproduce 0.92 by inheritance. Instead it ships a frozen, reproducible benign anchor set (data/benign_anchors_v1.json) and reports the F1/FPR it actually measures against it. The two numbers are different by construction; the library’s number is the one you can verify. Don’t quote the paper’s 0.92 as this package’s output.

How u_ref is built

  1. Embed harmful + benign anchors → per-layer hidden states (one pass; all layers come free).
  2. Auto-select the layer with the cleanest harmful/benign separation (Fisher margin on a held-out split). Pass layer=-1 to force the final layer and mirror the paper.
  3. u_ref = difference of class means at that layer.
  4. Calibrate τ for best F1 on a calibration split.
  5. Report F1/FPR on a disjoint test split.

Everything that affects the vector — model + revision, chosen layer, benign source + N, position, normalization, τ — is written to a u_ref card so each artifact is a documented, reproducible object.

Choosing a default by measurement, not ASR

Attack-success-rate heatmaps say how easy a model is to jailbreak; they say nothing about detection quality. To pick a default model, compare detection separability:

from aplomb.bench import bench_models, format_table
print(format_table(bench_models([HFBackbone("Qwen/Qwen2.5-1.5B-Instruct"), ...], harmful, benign)))

Benchmarking (the publishable F1)

The number in a freshly built card is a small-N held-out estimate, not a headline. For real F1/FPR, run the detector against JailbreakBench + XSTest:

python scripts/benchmark.py \
  --jbb-harmful jbb_harmful.csv --jbb-benign jbb_benign.csv --xstest xstest.csv

It reports F1/precision/recall/FPR on JailbreakBench at the shipped tau, the XSTest over-refusal FPR, and an oracle-tau diagnostic — and writes results_benchmark.json. Report the JBB @ shipped-tau F1 as the headline; the oracle number is an optimistic upper bound, not a deployment figure.

License & attribution

Library code: MIT. Bundled/derived data and compliance: see NOTICE — AdvBench (MIT), the frozen benign set, XSTest-inspired hard negatives (CC-BY-4.0 inspiration), Qwen (Apache-2.0), and the Built with Llama attribution required on the Llama opt-in path.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aplomb-0.3.0.tar.gz (42.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aplomb-0.3.0-py3-none-any.whl (41.9 kB view details)

Uploaded Python 3

File details

Details for the file aplomb-0.3.0.tar.gz.

File metadata

  • Download URL: aplomb-0.3.0.tar.gz
  • Upload date:
  • Size: 42.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for aplomb-0.3.0.tar.gz
Algorithm Hash digest
SHA256 9e032982b0a0898a56881ec691b0ed7f829beba89e57681c94dfc43a1bc958ba
MD5 27e91f1119a5aac669727ddb0ce2d50f
BLAKE2b-256 b85629b7195f914e4c7f22f9f23f985c575d1d1dd0a1ef94f72d47f48be5ec9e

See more details on using hashes here.

File details

Details for the file aplomb-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: aplomb-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 41.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for aplomb-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 82743dda332367670dd088031f6e48a08da6679c626c614cf0c37e00f6a4da23
MD5 d6cdd4b8a88bdf80d2272e7af5c8caf8
BLAKE2b-256 6626202d62688ef8db1081af6629f9c2a014a7b3669f8a48c49a2fb0cca4dec6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page