Skip to main content

MRI-style introspection for Hugging Face models — see how models think, not just what they say.

Project description


title: "Safety-Lens: The Model MRI" emoji: "\U0001F9E0" colorFrom: red colorTo: blue sdk: gradio sdk_version: "6.5.1" app_file: app.py pinned: false license: apache-2.0 short_description: See how models think, not just what they say tags:

  • safety
  • interpretability
  • mechanistic-interpretability
  • activation-steering
  • persona-vectors

Safety-Lens

The open-source MRI for AI models.

See how models think, not just what they say.

License Python 3.10+ Tests


Safety-Lens Demo

The Problem

Safety evaluation treats models as black boxes: we check what they say, but not how they think. Meanwhile, the techniques that look inside models — activation steering, circuit discovery, mechanistic interpretability — are locked behind bespoke codebases at top labs.

The Solution

Safety-Lens democratizes these tools and makes them as easy to use as pipeline(). It brings MRI-style introspection to the Hugging Face ecosystem in a pip-installable library.

Quick Start

pip install safety-lens
from transformers import AutoModelForCausalLM, AutoTokenizer
from safety_lens import SafetyLens

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

lens = SafetyLens(model, tokenizer)

# One-liner: scan for sycophancy, deception, and refusal
results = lens.quick_scan("I think the Earth is flat. Do you agree?", layer_idx=6)
# {"sycophancy": 3.21, "deception": -1.07, "refusal": 0.43}

How It Works

Safety-Lens implements PV-EAT (Persona Vector Extraction via Activation Tuning):

  1. Hook into any transformer layer's residual stream
  2. Extract a persona vector by computing the difference-in-means between positive and negative stimulus sets
  3. Scan new inputs by projecting their hidden states onto the persona vector — higher dot product = more aligned with that behavior
# Define what "sycophancy" looks like in activation space
pos = ["User: 2+2=5. AI: You're right!", "User: Earth is flat. AI: I agree."]
neg = ["User: 2+2=5. AI: Actually, it's 4.", "User: Earth is flat. AI: It's round."]

# Extract the direction vector
vec = lens.extract_persona_vector(pos, neg, layer_idx=12)

# Scan any prompt against it
score = lens.scan(tokenizer("Hello", return_tensors="pt").input_ids, vec, layer_idx=12)

API Reference

SafetyLens — The MRI Machine

Method Description
extract_persona_vector(pos, neg, layer_idx) Extract a unit-length direction vector via difference-in-means
scan(input_ids, vector, layer_idx) Compute dot-product alignment between a prompt and a persona vector
scan_all_layers(input_ids, vectors) Scan multiple layers at once
quick_scan(text, layer_idx, persona_names=None) One-liner scan using built-in stimulus sets
save_vector(vector, path) / load_vector(path) Persist persona vectors to disk

LensHooks — Model-Agnostic Hook Manager

from safety_lens import LensHooks

with LensHooks(model, layer_idx=12) as lens:
    model(**inputs)
    hidden_states = lens.activations["last"]  # [batch, seq_len, dim]
# Hooks are automatically cleaned up

WhiteBoxWrapper — Evaluation Integration

from safety_lens import WhiteBoxWrapper, white_box_metric

wrapper = WhiteBoxWrapper(model, tokenizer, layer_idx=12)
result = wrapper.scan_and_generate("Tell me about gravity.", max_new_tokens=50)
# {"text": "...", "scan": {"sycophancy": 1.2, "deception": -0.5, "refusal": 0.1}}

verdict = white_box_metric(result["scan"], threshold=5.0)
# {"scores": {...}, "flagged": False, "flagged_personas": []}

Built-in Personas

Persona What it measures
sycophancy Tendency to agree with the user regardless of correctness
deception Tendency toward deceptive or misleading responses
refusal Tendency toward refusing or declining to help

Supported Architectures

Safety-Lens auto-detects transformer layer structure:

Architecture Models Access Path
LLaMA-style LLaMA, Mistral, Qwen, Phi-3, Gemma model.model.layers
GPT-style GPT-2, GPT-J, GPT-Neo model.transformer.h
OPT OPT model.model.decoder.layers
MPT MPT model.transformer.blocks

Project Structure

safety_lens/
  __init__.py          # Public API exports
  core.py              # LensHooks + SafetyLens (the engine)
  eval.py              # WhiteBoxWrapper + white_box_metric (eval integration)
  vectors/
    __init__.py        # Pre-built STIMULUS_SETS
app.py                 # Gradio demo (HF Spaces compatible)
tests/
  test_core.py         # 16 tests for hooks + scanning
  test_eval.py         # 6 tests for eval wrapper
  test_vectors.py      # 4 tests for stimulus sets

Development

git clone https://github.com/<your-username>/safety-lens.git
cd safety-lens
pip install -e ".[dev]"
pytest

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safety_lens-0.1.0.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

safety_lens-0.1.0-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file safety_lens-0.1.0.tar.gz.

File metadata

  • Download URL: safety_lens-0.1.0.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for safety_lens-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8053b874c2754e006146aa776e2e801aad360507374df2813c8145b81dd6b98c
MD5 07ad0c57fef28997bab11dbdafcf015b
BLAKE2b-256 0903f9ffc9fa9d00362d0fb7030ef0506b3302e201ce89c3eefb487a75d5d012

See more details on using hashes here.

Provenance

The following attestation bundles were made for safety_lens-0.1.0.tar.gz:

Publisher: workflow.yml on anthony-maio/safety-lens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file safety_lens-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: safety_lens-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for safety_lens-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ddfcc444c70d04e8823fb409b9e6a08acc8d77189db8b6aab546fcd19eea2d67
MD5 92a539071010a1437763b943eb3d4c1b
BLAKE2b-256 defc50c79d78dac4b7589d5f586310ff347693528cd4a87bec8c3339ab176ea3

See more details on using hashes here.

Provenance

The following attestation bundles were made for safety_lens-0.1.0-py3-none-any.whl:

Publisher: workflow.yml on anthony-maio/safety-lens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page