Linear probes and activation steering for transformer models
Project description
reprobe
Linear probes and activation steering for transformer safety research
Built on ideas from Representation Engineering (2023)
reprobe is a tool for monitoring and steering LLMs. It helps you find where "concepts" (like toxicity or bias) live in the model's activations and lets you modify them in real-time.
Why? I built reprobe to provide a practical, efficient implementation of the RepE paper. My goal was to create a tool that works with large models on normal hardware, without losing the mathematical clarity and control needed for safety research.
Features
The library is designed to be highly ergonomic yet mathematically rigorous. It abstracts away the complex engineering so you can focus on the research.
-
Complete End-to-End Pipeline: Not just a steering script.
reprobeprovides a unified workflow to capture activations, train probes, and apply them (Monitoring & Steering). -
Phase-Aware Processing (Prefill vs. Token): Most naive implementations treat prompt processing and token generation the same way.
reprobeallows you to train and apply distinct probes for the prefill phase and the token phase, heavily improving steering quality. -
OOM-Proof Activation Storage: Capturing LLM activations usually blows up your RAM in seconds.
reprobestreams activations directly to disk using an optimizedh5pybackend (ActivationStore), allowing you to build massive datasets on consumer hardware. -
Granular Steering Control: Control the steering strength (
alpha) globally, per-layer, per-phase, or even dynamically using a custom callback function. You can also choose between projected (recommended) and uniform steering. -
Plug-and-Play with HuggingFace: Automatically detects the architecture of modern models (Llama, Qwen, Mistral, Phi, Gemma, etc.). It uses clean PyTorch forward hooks, meaning you don't have to rewrite the model's
forwardpass—just callmodel.generate()as usual. -
Cloud-Ready Probes: Load and share your trained
.ptorregistry.jsonprobes directly from local folders or HuggingFace Hub repositories.
Installation
pip install reprobe
Tested on Python ≥ 3.11 and PyTorch ≥ 2.6.
Quick Start: Monitor and/or Steering an LLM
If you already have trained probes (locally or on the HuggingFace Hub), steering a model takes only a few lines of code. During inference, the library stays out of your way: it adapts to your workflow, not the other way around.
Note: Probes are specific to the model they were trained on.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from reprobe import ProbeLoader
model_id = "Qwen/Qwen2.5-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
# 1. Load your probes and create a Steerer and Monitor
# You can load from a local path or directly from HuggingFace Hub
probe_dir = "YourUsername/your-probes-repo" # Local: "path/to/probes/registry.json" or "path/to/probes.pt"
steerer = ProbeLoader.steerer(
model,
probe_dir,
alpha={"prefill": 1.0, "token": 2.5}, # Steering strength
# We can also set an alpha per layer, or pass a callback function to set dynamically the alpha
filter=lambda meta: meta["layer"] in range(12, 20) # Only steer middle layers. Optional.
mode="all" # between "prefill", "token" and "all". Must be compatible with your probes.
)
monitor = ProbeLoader.monitor(
model,
probe_dir,
filter=lambda meta: meta["layer"] in range(12, 20), mode="prefill" # montior generally only need "prefill" to be efficient. But you can put "all" also. Token is more inefficient
)
# 2. Attach hooks to the model (/!\ Steerer can affect your generation output)
monitor.attach()
steerer.attach()
# 3. Generate text (the residual stream is now being steered in real-time)
inputs = tokenizer("How do I make a...", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0]))
# Retrieve monitor scores
score = monitor.score(
strategy = "max_of_means",
flush_buffer = False # Flush buffer resets the internal state of the monitor. If you want to re call score() or to calculate in continue the score, put it to False to keep intact the state. You must call at least on time flush_buffer between two generation. monitor.flush_buffer() does the same thing without scoring.
)
score_mean_of_means = monitor.score(
strategy = "mean_of_means"
)
# 4. Cleanup
monitor.detach()
steerer.detach()
# After detach, model can be recalled without monitor or steerer. But while probes stay attached, they are active
[!WARNING] Always call
monitor.flush_buffer()ormonitor.score(flush_buffer=True)between two generations. Calling score() without flushing accumulates history and returns incorrect results.
End-to-End Workflow: Train Your Own Probes
Want to train your own probes? The workflow is divided into 3 simple steps: Collect, Train, and Apply.
Tip: I recommend using mode="all". It allows you to use the probes for either prefill or token steering later during inference.
See a complete implementation of repE pipline with reprobe in examples/repe_harmless.py
Step 1: Collect Activations
Use Interceptor to hook into the model and ActivationStore to save the raw activations directly to an HDF5 file (safeguarding your RAM).
from reprobe import Interceptor, ActivationStore
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model= AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
prompts = ["I want to help people.", "I want to hurt people."]
# Initialize persistent HDF5 store
store = ActivationStore(
path="outputs/acts/store.h5",
N=len(prompts),
mode="all",
start_layer=10,
end_layer=model.config.num_hidden_layers
)
# Hook layers 10 through the end, capture both prefill and token activations
interceptor = Interceptor(model, start_layer=10, training_mode="all").attach()
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt")
interceptor.allow_one_capture(batch_size=1) # IMPORTANT
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=20)
# Get activations for this prompt
flushed = interceptor.flush_batch() # If you train only for prefill, "token" will be None and vice versa
# Define your labels (0.0 = safe, 1.0 = unsafe). Can be continuous
# Usually provided by a classifier or dataset annotations.
prefill_label = torch.tensor([0.0]) # Example label
token_labels = [torch.zeros(flushed["token"][0].shape[0])]
# Stream to disk incrementally
store.append(
acts=flushed,
labels={"prefill": prefill_label, "token": token_labels} # "token" to None if you train only for prefill, and vice-versa
)
interceptor.detach()
Step 2: Train the Probes
The ProbesTrainer reads directly from the ActivationStore to train one logistic regression probe per layer, per mode.
from reprobe import ProbesTrainer
trainer = ProbesTrainer("Qwen/Qwen2.5-1.5B", hidden_dim=store.hidden_dim)
trainer.train_probes(
store=store,
concepts=["harmfulness"], # Metadata for your registry
training_mode="all", # trains prefill and token probes separately
epochs=10,
batch_size=256,
show_tqdm=True,
)
trainer.save("outputs/probes/") # Human-readable JSON + weights
# OR
trainer.save("outputs/probes/", filename="probes.py", single_file=True) # All in one file, compact, useful for export. Non human readable
Step 3: Monitor & Steer
Load the trained probes using ProbeLoader. You can use a Monitor to get real-time concept probability scores, and a Steerer to actively suppress the concept.
from reprobe import ProbeLoader
steerer = ProbeLoader.steerer(
model,
"outputs/probes/registry.json",
alpha={"prefill": 0.5, "token": 1.5}, # Different strengths per phase
filter=lambda meta: meta["layer"] in range(12, 18),
)
monitor = ProbeLoader.monitor(
model,
"outputs/probes/registry.json",
filter=lambda meta: meta["layer"] in range(12, 18),
)
steerer.attach()
monitor.attach()
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=150)
# Get analytics
score = monitor.score() # Global float in [0, 1]
history = monitor.get_history() # [{layer: prob}, ...] per generated token
steerer.detach()
monitor.detach()
Key Concepts & Parameters
Training & Capturing Modes
The mode parameter ("prefill", "token", or "all") is everywhere in reprobe.
prefill: Operates only on the initial prompt processing pass.token: Operates only on the autoregressive generation pass (token by token).all: Captures/Trains both. Highly recommended, as separating these distributions yields much cleaner steering.
Steering Parameters (ProbeLoader.steerer)
| Parameter | What it does |
|---|---|
alpha |
The steering strength. Accepts a float (global), a dict[int, float] (per layer), a dict[str, float] (per mode, e.g., {"prefill": 0.7, "token": 1.2}), or a custom Callable[[dict], float] receiving probe metadata. Higher = more aggressive suppression, with a higher risk of degrading neutral outputs. |
filter |
Callable[[dict], bool]. Lets you select a subset of probes at load time without modifying saved files. Excellent for layer-ablation experiments. |
steering_mode |
"projected" (default) subtracts only the component of the residual stream along the probe direction. "uniform" subtracts the full direction vector. Projected is highly recommended as it preserves capabilities better. |
Monitor Strategies (Monitor.score)
How to aggregate per-layer, per-token probabilities into a single score:
"max_of_means"(default): Max over tokens of the mean across layers."mean_of_means": Global average."max_absolute": Single highest probability seen across any layer at any token step.
Architecture Support
Layer auto-detection works out-of-the-box for:
Llama, Qwen, Mistral, Phi-3, Gemma, GPT-2, BLOOM, GPT-NeoX, Pythia, and OPT.
For non-standard architectures, simply pass the path to the Transformer layers manually:
Interceptor(model, _layers_path="custom.transformer.blocks")
Contributing & Source
If you want to contribute, run tests, or build from source:
git clone https://github.com/levashi/reprobe
cd reprobe
pip install -e ".[dev]"
pytest
Roadmap
reprobe is actively developed. Here’s what’s coming next:
-
Extend model support: extand support to every encoder-only models for classification probing
-
Unsupervised Reading (PCA/LAT): Implement Linear Artificial Tomography to extract concepts without explicit labels using contrastive pairs (as seen in the original RepE paper).
-
Visualization Suite: Built-in tools to generate layer-wise heatmaps and activation density plots to "see" the concepts.
-
Precision Control: Support for KL-divergence monitoring to ensure steering doesn't degrade the model's base capabilities (perplexity tracking).
Author
reprobe is my first open-source library. I built it because I’m passionate about AI safety and I wanted to make activation steering more accessible for everyone. I spend months on it, so I hope it help you :)
Since I’m still learning, please feel free to open an issue or a PR if you find a bug or have an idea to improve the code. Every feedback is welcome!
mechanistic-interpretability activation-steering representation-engineering linear-probes llm-safety transformers pytorch ai-safety
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file reprobe-0.1.0.tar.gz.
File metadata
- Download URL: reprobe-0.1.0.tar.gz
- Upload date:
- Size: 26.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
291c6e61b82add261a6107ab350fe25f822e6904d34c58d3eeff8201879d09af
|
|
| MD5 |
082aada72aed8ea2859a15ff160e73bf
|
|
| BLAKE2b-256 |
505446604fcedb51e3a2e16e446fdc2b37e44aec1d271c6833c1ff8fbad6dcea
|
File details
Details for the file reprobe-0.1.0-py3-none-any.whl.
File metadata
- Download URL: reprobe-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6867bfd8748426e6ca4f4c56acb9be20145020ec204716ae3b1d1a61e3bc102
|
|
| MD5 |
71a5f8298dda48bea006c5c6558a13db
|
|
| BLAKE2b-256 |
9101121114e7015a44db8bdf8b3dfb4a76bf2a561f7bcfb0728538d826d520e5
|