MRI-style introspection for Hugging Face models — see how models think, not just what they say.
Project description
title: "Safety-Lens: The Model MRI" emoji: "\U0001F9E0" colorFrom: red colorTo: blue sdk: gradio sdk_version: "6.5.1" app_file: app.py pinned: false license: apache-2.0 short_description: See how models think, not just what they say tags:
- safety
- interpretability
- mechanistic-interpretability
- activation-steering
- persona-vectors
The Problem
Safety evaluation treats models as black boxes: we check what they say, but not how they think. Meanwhile, the techniques that look inside models — activation steering, circuit discovery, mechanistic interpretability — are locked behind bespoke codebases at top labs.
The Solution
Safety-Lens democratizes these tools and makes them as easy to use as pipeline(). It brings MRI-style introspection to the Hugging Face ecosystem in a pip-installable library.
Quick Start
pip install safety-lens
from transformers import AutoModelForCausalLM, AutoTokenizer
from safety_lens import SafetyLens
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
lens = SafetyLens(model, tokenizer)
# One-liner: scan for sycophancy, deception, and refusal
results = lens.quick_scan("I think the Earth is flat. Do you agree?", layer_idx=6)
# {"sycophancy": 3.21, "deception": -1.07, "refusal": 0.43}
How It Works
Safety-Lens implements PV-EAT (Persona Vector Extraction via Activation Tuning):
- Hook into any transformer layer's residual stream
- Extract a persona vector by computing the difference-in-means between positive and negative stimulus sets
- Scan new inputs by projecting their hidden states onto the persona vector — higher dot product = more aligned with that behavior
# Define what "sycophancy" looks like in activation space
pos = ["User: 2+2=5. AI: You're right!", "User: Earth is flat. AI: I agree."]
neg = ["User: 2+2=5. AI: Actually, it's 4.", "User: Earth is flat. AI: It's round."]
# Extract the direction vector
vec = lens.extract_persona_vector(pos, neg, layer_idx=12)
# Scan any prompt against it
score = lens.scan(tokenizer("Hello", return_tensors="pt").input_ids, vec, layer_idx=12)
API Reference
SafetyLens — The MRI Machine
| Method | Description |
|---|---|
extract_persona_vector(pos, neg, layer_idx) |
Extract a unit-length direction vector via difference-in-means |
scan(input_ids, vector, layer_idx) |
Compute dot-product alignment between a prompt and a persona vector |
scan_all_layers(input_ids, vectors) |
Scan multiple layers at once |
quick_scan(text, layer_idx, persona_names=None) |
One-liner scan using built-in stimulus sets |
save_vector(vector, path) / load_vector(path) |
Persist persona vectors to disk |
LensHooks — Model-Agnostic Hook Manager
from safety_lens import LensHooks
with LensHooks(model, layer_idx=12) as lens:
model(**inputs)
hidden_states = lens.activations["last"] # [batch, seq_len, dim]
# Hooks are automatically cleaned up
WhiteBoxWrapper — Evaluation Integration
from safety_lens import WhiteBoxWrapper, white_box_metric
wrapper = WhiteBoxWrapper(model, tokenizer, layer_idx=12)
result = wrapper.scan_and_generate("Tell me about gravity.", max_new_tokens=50)
# {"text": "...", "scan": {"sycophancy": 1.2, "deception": -0.5, "refusal": 0.1}}
verdict = white_box_metric(result["scan"], threshold=5.0)
# {"scores": {...}, "flagged": False, "flagged_personas": []}
Built-in Personas
| Persona | What it measures |
|---|---|
sycophancy |
Tendency to agree with the user regardless of correctness |
deception |
Tendency toward deceptive or misleading responses |
refusal |
Tendency toward refusing or declining to help |
Supported Architectures
Safety-Lens auto-detects transformer layer structure:
| Architecture | Models | Access Path |
|---|---|---|
| LLaMA-style | LLaMA, Mistral, Qwen, Phi-3, Gemma | model.model.layers |
| GPT-style | GPT-2, GPT-J, GPT-Neo | model.transformer.h |
| OPT | OPT | model.model.decoder.layers |
| MPT | MPT | model.transformer.blocks |
Project Structure
safety_lens/
__init__.py # Public API exports
core.py # LensHooks + SafetyLens (the engine)
eval.py # WhiteBoxWrapper + white_box_metric (eval integration)
vectors/
__init__.py # Pre-built STIMULUS_SETS
app.py # Gradio demo (HF Spaces compatible)
tests/
test_core.py # 16 tests for hooks + scanning
test_eval.py # 6 tests for eval wrapper
test_vectors.py # 4 tests for stimulus sets
Development
git clone https://github.com/<your-username>/safety-lens.git
cd safety-lens
pip install -e ".[dev]"
pytest
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file safety_lens-0.1.0.tar.gz.
File metadata
- Download URL: safety_lens-0.1.0.tar.gz
- Upload date:
- Size: 15.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8053b874c2754e006146aa776e2e801aad360507374df2813c8145b81dd6b98c
|
|
| MD5 |
07ad0c57fef28997bab11dbdafcf015b
|
|
| BLAKE2b-256 |
0903f9ffc9fa9d00362d0fb7030ef0506b3302e201ce89c3eefb487a75d5d012
|
Provenance
The following attestation bundles were made for safety_lens-0.1.0.tar.gz:
Publisher:
workflow.yml on anthony-maio/safety-lens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
safety_lens-0.1.0.tar.gz -
Subject digest:
8053b874c2754e006146aa776e2e801aad360507374df2813c8145b81dd6b98c - Sigstore transparency entry: 941320439
- Sigstore integration time:
-
Permalink:
anthony-maio/safety-lens@f3b7e60cc917dc2978acc7a6450b2a0602324843 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/anthony-maio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@f3b7e60cc917dc2978acc7a6450b2a0602324843 -
Trigger Event:
release
-
Statement type:
File details
Details for the file safety_lens-0.1.0-py3-none-any.whl.
File metadata
- Download URL: safety_lens-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddfcc444c70d04e8823fb409b9e6a08acc8d77189db8b6aab546fcd19eea2d67
|
|
| MD5 |
92a539071010a1437763b943eb3d4c1b
|
|
| BLAKE2b-256 |
defc50c79d78dac4b7589d5f586310ff347693528cd4a87bec8c3339ab176ea3
|
Provenance
The following attestation bundles were made for safety_lens-0.1.0-py3-none-any.whl:
Publisher:
workflow.yml on anthony-maio/safety-lens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
safety_lens-0.1.0-py3-none-any.whl -
Subject digest:
ddfcc444c70d04e8823fb409b9e6a08acc8d77189db8b6aab546fcd19eea2d67 - Sigstore transparency entry: 941320460
- Sigstore integration time:
-
Permalink:
anthony-maio/safety-lens@f3b7e60cc917dc2978acc7a6450b2a0602324843 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/anthony-maio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@f3b7e60cc917dc2978acc7a6450b2a0602324843 -
Trigger Event:
release
-
Statement type: