Interpretable, zero-training refusal-axis prompt detector (u_ref difference-of-means).

These details have not been verified by PyPI

Project links

Project description

aplomb

à plomb — "to the plumb line." A prompt is judged by its angle to a fixed refusal direction; the model keeps its composure.

An interpretable, zero-training prompt safety detector. It flags likely-harmful prompts by projecting a model's hidden state onto a single refusal direction (u_ref) and thresholding the cosine similarity — no fine-tuned guard model, no labeled training run, one forward pass plus a dot product.

Method from “The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs” (TrustNLP @ ACL 2026). This package is the detector only. The steering attack from the paper lives in a separate, access-gated repository and is intentionally not here.

u_ref = mean(hidden states of harmful anchors) − mean(hidden states of benign anchors)
score(prompt) = cosine(hidden_state(prompt), u_ref)        # flag if > τ

⚠️ This is triage, not a security boundary. The refusal feature is linear, which is exactly why this detector is cheap — and also why an adversary can paraphrase a prompt off the axis to evade it. Use it as an interpretable first-pass filter and always report FPR. A “safe” verdict is a hint, not a guarantee.

Install

pip install aplomb            # core (numpy only)
pip install 'aplomb[hf]'      # + torch/transformers to run real models

Quickstart

from aplomb import Detector

det = Detector.from_default()                 # precomputed Qwen-2.5-1.5B u_ref (ungated)
print(det.classify("how do I pick a lock"))   # {'unsafe': True, 'score': 0.61, ...}

The default backbone is Qwen-2.5-1.5B-Instruct — ungated, Apache-2.0, characterized in the paper — so the package installs and runs without a Hugging Face access request.

Use a different model

u_ref is model-specific, so changing the model means rebuilding the vector. That’s one call; the library auto-selects the best layer for the new model and recalibrates the threshold:

from aplomb import Detector, HFBackbone

# AdvBench (MIT) is the harmful half; the frozen default benign set fills the benign half.
harmful = load_advbench()          # your loader
det = Detector.build(HFBackbone("meta-llama/Llama-3.1-8B-Instruct"), harmful,
                     save_to="uref_llama31.json")
print(det)   # Detector(model='...Llama-3.1-8B', layer=31, tau=..., f1=..., fpr=...)

For paper-grade separation, rebuild on Llama-3.1-8B (gated: accept Meta’s license and huggingface-cli login first). Built with Llama.

On the F1 number (please read)

The paper validates the method at F1 = 0.92 on Llama-3.1-8B. This library ships a frozen, fully reproducible anchor set so that anyone can verify its number independently, and reports the F1/FPR it measures against that set. (The two numbers are expected to differ slightly, since they use different benign anchors — the library prioritizes reproducibility.)

Measured results (0.2.0)

Default detector — Qwen-2.5-1.5B-Instruct, zero training, 50 harmful + 50 benign anchors — evaluated at its shipped threshold:

benchmark	metric	value
JailbreakBench (100 harmful / 100 benign)	F1	0.81
	recall	0.89
	precision	0.75
	FPR	0.30
XSTest (250 safe prompts)	over-refusal FPR	0.27

High recall (~89% of harmful prompts caught) at moderate precision: a fast, interpretable triage pre-filter, not a standalone guard. The 27% XSTest over-refusal is the known weakness of a small zero-training detector on benign-but-sensitive prompts. This is the reproducible figure to cite — not the paper's 0.92, which used a different model and anchor set.

How `u_ref` is built

Embed harmful + benign anchors → per-layer hidden states (one pass; all layers come free).
Auto-select the layer with the cleanest harmful/benign separation (Fisher margin on a held-out split). Pass layer=-1 to force the final layer and mirror the paper.
u_ref = difference of class means at that layer.
Calibrate τ for best F1 on a calibration split.
Report F1/FPR on a disjoint test split.

Everything that affects the vector — model + revision, chosen layer, benign source + N, position, normalization, τ — is written to a u_ref card so each artifact is a documented, reproducible object.

Choosing a default by measurement, not ASR

Attack-success-rate heatmaps say how easy a model is to jailbreak; they say nothing about detection quality. To pick a default model, compare detection separability:

from aplomb.bench import bench_models, format_table
print(format_table(bench_models([HFBackbone("Qwen/Qwen2.5-1.5B-Instruct"), ...], harmful, benign)))

License & attribution

Library code: MIT. Bundled/derived data and compliance: see NOTICE — AdvBench (MIT), the frozen benign set, XSTest-inspired hard negatives (CC-BY-4.0 inspiration), Qwen (Apache-2.0), and the Built with Llama attribution required on the Llama opt-in path.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Jun 27, 2026

This version

0.2.0

Jun 27, 2026

0.1.1

Jun 27, 2026

0.1.0

Jun 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aplomb-0.2.0.tar.gz (40.9 kB view details)

Uploaded Jun 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aplomb-0.2.0-py3-none-any.whl (41.1 kB view details)

Uploaded Jun 27, 2026 Python 3

File details

Details for the file aplomb-0.2.0.tar.gz.

File metadata

Download URL: aplomb-0.2.0.tar.gz
Upload date: Jun 27, 2026
Size: 40.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for aplomb-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7c3fe6a83e3814c27a44f6b39209389f23392e26d0dbb2c7522053ec8ccc9c4c`
MD5	`e6fc9848cdd3ae4ccf63cbe9d4eab8d7`
BLAKE2b-256	`301e14c7d73b5e2864fef29bd5348e2d367e8a1f8062b509f589b24730f762cf`

See more details on using hashes here.

File details

Details for the file aplomb-0.2.0-py3-none-any.whl.

File metadata

Download URL: aplomb-0.2.0-py3-none-any.whl
Upload date: Jun 27, 2026
Size: 41.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.14

File hashes

Hashes for aplomb-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2c01a32f2f7254ce5164b4e4b4d200f2d52d22353e7c4bf7921f563419992b13`
MD5	`755643accde4e3e6f64dc4b5399c5db9`
BLAKE2b-256	`1e83a63d67c7439a4a41366fe83a9fbcb7a32c28f95181bdfe1dbfb1e42b3e94`

See more details on using hashes here.

aplomb 0.2.0

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

aplomb

Install

Quickstart

Use a different model

On the F1 number (please read)

Measured results (0.2.0)

How `u_ref` is built

Choosing a default by measurement, not ASR

License & attribution

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

aplomb 0.2.0

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

aplomb

Install

Quickstart

Use a different model

On the F1 number (please read)

Measured results (0.2.0)

How u_ref is built

Choosing a default by measurement, not ASR

License & attribution

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

How `u_ref` is built