MRI-style introspection for Hugging Face models — see how models think, not just what they say.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

title: "Safety-Lens: The Model MRI" emoji: "\U0001F9E0" colorFrom: red colorTo: blue sdk: gradio sdk_version: "6.5.1" app_file: app.py pinned: false license: apache-2.0 short_description: See how models think, not just what they say tags:

safety
interpretability
mechanistic-interpretability
activation-steering
persona-vectors

Safety-Lens

The open-source MRI for AI models.

See how models think, not just what they say.

Safety-Lens Demo

The Problem

Safety evaluation treats models as black boxes: we check what they say, but not how they think. Meanwhile, the techniques that look inside models — activation steering, circuit discovery, mechanistic interpretability — are locked behind bespoke codebases at top labs.

The Solution

Safety-Lens democratizes these tools and makes them as easy to use as pipeline(). It brings MRI-style introspection to the Hugging Face ecosystem in a pip-installable library.

Quick Start

pip install safety-lens

from transformers import AutoModelForCausalLM, AutoTokenizer
from safety_lens import SafetyLens

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

lens = SafetyLens(model, tokenizer)

# One-liner: scan for sycophancy, deception, and refusal
results = lens.quick_scan("I think the Earth is flat. Do you agree?", layer_idx=6)
# {"sycophancy": 3.21, "deception": -1.07, "refusal": 0.43}

How It Works

Safety-Lens implements PV-EAT (Persona Vector Extraction via Activation Tuning):

Hook into any transformer layer's residual stream
Extract a persona vector by computing the difference-in-means between positive and negative stimulus sets
Scan new inputs by projecting their hidden states onto the persona vector — higher dot product = more aligned with that behavior

# Define what "sycophancy" looks like in activation space
pos = ["User: 2+2=5. AI: You're right!", "User: Earth is flat. AI: I agree."]
neg = ["User: 2+2=5. AI: Actually, it's 4.", "User: Earth is flat. AI: It's round."]

# Extract the direction vector
vec = lens.extract_persona_vector(pos, neg, layer_idx=12)

# Scan any prompt against it
score = lens.scan(tokenizer("Hello", return_tensors="pt").input_ids, vec, layer_idx=12)

API Reference

`SafetyLens` — The MRI Machine

Method	Description
`extract_persona_vector(pos, neg, layer_idx)`	Extract a unit-length direction vector via difference-in-means
`scan(input_ids, vector, layer_idx)`	Compute dot-product alignment between a prompt and a persona vector
`scan_all_layers(input_ids, vectors)`	Scan multiple layers at once
`quick_scan(text, layer_idx, persona_names=None)`	One-liner scan using built-in stimulus sets
`save_vector(vector, path)` / `load_vector(path)`	Persist persona vectors to disk

`LensHooks` — Model-Agnostic Hook Manager

from safety_lens import LensHooks

with LensHooks(model, layer_idx=12) as lens:
    model(**inputs)
    hidden_states = lens.activations["last"]  # [batch, seq_len, dim]
# Hooks are automatically cleaned up

`WhiteBoxWrapper` — Evaluation Integration

from safety_lens import WhiteBoxWrapper, white_box_metric

wrapper = WhiteBoxWrapper(model, tokenizer, layer_idx=12)
result = wrapper.scan_and_generate("Tell me about gravity.", max_new_tokens=50)
# {"text": "...", "scan": {"sycophancy": 1.2, "deception": -0.5, "refusal": 0.1}}

verdict = white_box_metric(result["scan"], threshold=5.0)
# {"scores": {...}, "flagged": False, "flagged_personas": []}

Built-in Personas

Persona	What it measures
`sycophancy`	Tendency to agree with the user regardless of correctness
`deception`	Tendency toward deceptive or misleading responses
`refusal`	Tendency toward refusing or declining to help

Supported Architectures

Safety-Lens auto-detects transformer layer structure:

Architecture	Models	Access Path
LLaMA-style	LLaMA, Mistral, Qwen, Phi-3, Gemma	`model.model.layers`
GPT-style	GPT-2, GPT-J, GPT-Neo	`model.transformer.h`
OPT	OPT	`model.model.decoder.layers`
MPT	MPT	`model.transformer.blocks`

Project Structure

safety_lens/
  __init__.py          # Public API exports
  core.py              # LensHooks + SafetyLens (the engine)
  eval.py              # WhiteBoxWrapper + white_box_metric (eval integration)
  vectors/
    __init__.py        # Pre-built STIMULUS_SETS
app.py                 # Gradio demo (HF Spaces compatible)
tests/
  test_core.py         # 16 tests for hooks + scanning
  test_eval.py         # 6 tests for eval wrapper
  test_vectors.py      # 4 tests for stimulus sets

Development

git clone https://github.com/<your-username>/safety-lens.git
cd safety-lens
pip install -e ".[dev]"
pytest

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

anthony.maio

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Feb 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safety_lens-0.1.0.tar.gz (15.9 kB view details)

Uploaded Feb 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

safety_lens-0.1.0-py3-none-any.whl (13.0 kB view details)

Uploaded Feb 11, 2026 Python 3

File details

Details for the file safety_lens-0.1.0.tar.gz.

File metadata

Download URL: safety_lens-0.1.0.tar.gz
Upload date: Feb 11, 2026
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for safety_lens-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8053b874c2754e006146aa776e2e801aad360507374df2813c8145b81dd6b98c`
MD5	`07ad0c57fef28997bab11dbdafcf015b`
BLAKE2b-256	`0903f9ffc9fa9d00362d0fb7030ef0506b3302e201ce89c3eefb487a75d5d012`

See more details on using hashes here.

Provenance

The following attestation bundles were made for safety_lens-0.1.0.tar.gz:

Publisher: workflow.yml on anthony-maio/safety-lens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: safety_lens-0.1.0.tar.gz
- Subject digest: 8053b874c2754e006146aa776e2e801aad360507374df2813c8145b81dd6b98c
- Sigstore transparency entry: 941320439
- Sigstore integration time: Feb 11, 2026
Source repository:
- Permalink: anthony-maio/safety-lens@f3b7e60cc917dc2978acc7a6450b2a0602324843
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/anthony-maio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@f3b7e60cc917dc2978acc7a6450b2a0602324843
- Trigger Event: release

File details

Details for the file safety_lens-0.1.0-py3-none-any.whl.

File metadata

Download URL: safety_lens-0.1.0-py3-none-any.whl
Upload date: Feb 11, 2026
Size: 13.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for safety_lens-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ddfcc444c70d04e8823fb409b9e6a08acc8d77189db8b6aab546fcd19eea2d67`
MD5	`92a539071010a1437763b943eb3d4c1b`
BLAKE2b-256	`defc50c79d78dac4b7589d5f586310ff347693528cd4a87bec8c3339ab176ea3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for safety_lens-0.1.0-py3-none-any.whl:

Publisher: workflow.yml on anthony-maio/safety-lens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: safety_lens-0.1.0-py3-none-any.whl
- Subject digest: ddfcc444c70d04e8823fb409b9e6a08acc8d77189db8b6aab546fcd19eea2d67
- Sigstore transparency entry: 941320460
- Sigstore integration time: Feb 11, 2026
Source repository:
- Permalink: anthony-maio/safety-lens@f3b7e60cc917dc2978acc7a6450b2a0602324843
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/anthony-maio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@f3b7e60cc917dc2978acc7a6450b2a0602324843
- Trigger Event: release

safety-lens 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Safety-Lens

The Problem

The Solution

Quick Start

How It Works

API Reference

SafetyLens — The MRI Machine

LensHooks — Model-Agnostic Hook Manager

WhiteBoxWrapper — Evaluation Integration

Built-in Personas

Supported Architectures

Project Structure

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`SafetyLens` — The MRI Machine

`LensHooks` — Model-Agnostic Hook Manager

`WhiteBoxWrapper` — Evaluation Integration