Mechanistic interpretability toolkit for reward models
Project description
reward-lens
Mechanistic interpretability toolkit for reward models.
The first comprehensive open-source library for understanding what happens inside the models that define the RLHF training signal. Reward-lens is to reward model interpretability what TransformerLens is to generative model interpretability — the foundation that makes the work possible.
Known Limitations (Read Before Using)
Attribution ≠ Causal Importance
The most important finding from our own validation experiments: component attribution does NOT reliably predict causal importance. Spearman ρ between attribution and patch effects was -0.256 (Skywork) and -0.027 (ArmoRM) — negative to zero, never positive.
This means: use the Reward Lens and attribution for exploration, then validate important claims with activation patching. This actually strengthens your credibility in the mech interp community — researchers respect honesty about limitations far more than overselling.
Why This Exists
Every RLHF-trained language model was shaped by a reward model. The reward model is the mathematical object that encodes "what we want." It is the most safety-critical component in the alignment pipeline — and as of early 2026, it has received approximately 0.5% of the interpretability community's attention.
This is not because reward models are hard to study. They may actually be easier than generative models:
- Scalar output — attribution targets a single number, not a 50K-token distribution
- Built-in contrastive structure — preference pairs give natural controlled comparisons
- Known "answer direction" — the reward head weight vector defines exactly what the model is optimizing
Reward-lens provides the tools to exploit these structural advantages.
Architectural Decisions
Why not TransformerLens?
TransformerLens was built for generative models. Its core abstractions — the logit lens, direct logit attribution, the unembedding matrix — all assume the model outputs a distribution over vocabulary tokens. Reward models replace the unembedding with a scalar head, which breaks every one of these tools.
Rather than fighting TransformerLens's abstractions, reward-lens builds purpose-built primitives directly on HuggingFace transformers models using lightweight PyTorch hooks. This means:
- Any HuggingFace reward model works out of the box —
AutoModelForSequenceClassification, custom reward heads, multi-objective models - No model zoo dependency — if HuggingFace can load it, reward-lens can analyze it
- The hook system is minimal and auditable — ~200 lines, not thousands
Why not nnsight?
nnsight is a powerful general-purpose intervention library. But reward model interpretability needs domain-specific primitives — reward lens plots, differential reward attribution, preference circuit identification — that would be clumsy to build on top of a generic framework. We build these as first-class citizens.
The Core Insight
The reward head is a linear projection: r(x,y) = w_r^T @ h_final + b. The weight vector w_r defines the reward direction in activation space. Every tool in this library is, at its core, a projection onto or decomposition along this direction:
- Reward Lens: project each layer's residual stream onto
w_rto see when preference forms - Component Attribution: decompose
h_finalinto per-head, per-MLP contributions and project each ontow_r - Feature Attribution: decompose through SAE features and measure each feature's alignment with
w_r - Activation Patching: swap components between preferred/dispreferred and measure reward change
Installation
pip install -e .
For SAE training support:
pip install -e ".[sae]"
For development:
pip install -e ".[all]"
Quick Start
5-Line Reward Lens
from reward_lens import RewardModel, reward_lens_plot
model = RewardModel.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")
prompt = "Explain quantum computing."
good = "Quantum computing uses qubits that can exist in superposition..."
bad = "Quantum computing is when computers are really fast..."
reward_lens_plot(model, prompt, good, bad, save_path="reward_lens.png")
Full Analysis Pipeline
from reward_lens import RewardModel
from reward_lens.lens import RewardLens
from reward_lens.attribution import ComponentAttribution
from reward_lens.patching import ActivationPatcher
# Load model
rm = RewardModel.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")
# Define preference pair
prompt = "What is 2+2?"
preferred = "2+2 equals 4."
dispreferred = "2+2 equals 5."
# 1. Reward Lens — when does preference form?
lens = RewardLens(rm)
result = lens.trace(prompt, preferred, dispreferred)
result.plot() # Layer-by-layer preference formation
print(f"Preference crystallizes at layer {result.crystallization_layer}")
# 2. Component Attribution — which heads/MLPs drive the preference?
attrib = ComponentAttribution(rm)
components = attrib.attribute(prompt, preferred, dispreferred)
components.plot_top_k(k=15) # Top 15 components by reward contribution
# 3. Activation Patching — which components are causally necessary?
patcher = ActivationPatcher(rm)
effects = patcher.patch_all_components(prompt, preferred, dispreferred)
effects.plot() # Heatmap of patch effects
Reward Hacking Detection
from reward_lens import RewardModel
from reward_lens.hacking import HackingDetector
rm = RewardModel.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")
detector = HackingDetector(rm)
# Test for known failure modes
report = detector.scan(
prompt="Explain relativity.",
response="Einstein's theory of relativity...",
tests=["length", "confidence", "formatting", "sycophancy"],
)
report.print_summary()
# Length bias: +0.34 reward per 100 tokens (SIGNIFICANT)
# Confidence bias: +0.12 for authoritative vs hedged (moderate)
# Formatting bias: +0.08 for markdown vs plain (low)
Predictive Hacking Analysis (v0.2.0)
from reward_lens import RewardModel, DistortionAnalyzer
from reward_lens.diagnostic_data import get_diagnostic_pairs
rm = RewardModel.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")
# Predict which quality dimensions are under-covered (will be hacked)
analyzer = DistortionAnalyzer(rm)
report = analyzer.compute_distortion_index(
quality_dimensions=["helpfulness", "safety", "honesty"],
evaluation_probes={
"helpfulness": get_diagnostic_pairs(["helpfulness"]),
"safety": get_diagnostic_pairs(["safety"]),
"honesty": [], # No probes - will be flagged as under-covered!
},
)
report.print_summary()
# Shows "honesty" has high distortion index (likely to be hacked)
Misalignment Cascade Detection (v0.2.0)
from reward_lens import MisalignmentCascadeDetector
detector = MisalignmentCascadeDetector(rm)
report = detector.detect_cascade() # Tests multiple misalignment dimensions
report.print_summary()
# Shows if failures are correlated (systemic vulnerability)
Concept Vector Analysis (v0.2.0)
from reward_lens import quick_concept_analysis
report = quick_concept_analysis(rm)
report.print_summary()
# Shows which concepts (confidence, verbosity, sycophancy)
# align with reward and may be hackable
Core Modules
reward_lens.model — Reward Model Wrapper
Wraps any HuggingFace reward model with hooks for activation caching and intervention. Handles the architectural differences between single-scalar models (Skywork, Starling) and multi-objective models (ArmoRM, Nemotron).
reward_lens.lens — Reward Lens
The core primitive. Projects intermediate residual stream states onto the reward direction to trace preference formation across layers. The reward model analogue of the logit lens.
reward_lens.attribution — Component Attribution
Decomposes the reward score into signed per-component contributions (each attention head and MLP layer). Answers: "why did the model assign this score?"
reward_lens.patching — Activation Patching
Causal intervention tool. Swaps component activations between preferred and dispreferred completions to identify causally necessary components for each preference dimension.
reward_lens.hacking — Reward Hacking Detection
Automated detection of hackable features in reward models. Tests for length bias, confidence bias, formatting bias, sycophancy, and more. Produces vulnerability reports.
reward_lens.sae — Sparse Autoencoder Integration
Train and apply SAEs to reward model activations. Decompose reward into interpretable feature-level contributions. Identify features aligned with the reward direction.
reward_lens.diagnostic_data — Diagnostic Datasets
Curated preference pairs for controlled experiments across preference dimensions: helpfulness, safety, verbosity, sycophancy, formatting, confidence.
New Modules (v0.2.0)
Based on cutting-edge interpretability research (2025-2026):
reward_lens.distortion — Distortion Index
Predicts which quality dimensions are under-covered by evaluation and thus likely to be hacked. Based on "Reward Hacking as Equilibrium under Finite Evaluation" — moves from detecting hacking to predicting it.
reward_lens.divergence_patching — Divergence-Aware Patching
Extends activation patching with out-of-distribution detection. Flags when interventions create divergent representations that may make causal claims unreliable. Based on "Addressing Divergent Representations from Causal Interventions."
reward_lens.cascade — Misalignment Cascade Detection
Tests for correlations between different misalignment dimensions. Based on "Natural Emergent Misalignment from Reward Hacking" — reward hacking onset correlates with broad emergent misalignment.
reward_lens.conflict — Reward Conflict Analysis
Classifies relationships between reward terms as aligned/orthogonal/in-conflict. In-conflict terms may cause models to hide reasoning. Based on "When Can We Safely Optimize CoT?"
reward_lens.concepts — Concept Vector Extraction
Extracts linear concept vectors from activations and analyzes their reward alignment. Identifies concepts that may enable hacking (e.g., confidence, verbosity, sycophancy). Based on "Emotion Concepts and their Function in an LLM."
Supported Models
| Model | Architecture | Type | Status |
|---|---|---|---|
| Skywork-Reward-Llama-3.1-8B-v0.2 | Llama 3.1 + classification head | Single scalar | ✅ Full support |
| ArmoRM-Llama3-8B-v0.1 | Llama 3 + multi-objective head + MoE gating | Multi-objective | ✅ Full support |
| Nemotron-4-340B-Reward | Nemotron + 5-dim linear head | Multi-dimensional | ⚠️ Requires multi-GPU |
| FsfairX-LLaMA3-RM-v0.1 | Llama 3 + classification head | Single scalar | ✅ Full support |
Any AutoModelForSequenceClassification |
Varies | Single scalar | ✅ Auto-detected |
Adding new models: Any model loadable via AutoModelForSequenceClassification with a linear reward head works automatically. Models with custom architectures (like ArmoRM's MoE gating) need a thin adapter — see reward_lens/model_adapters/.
What This Toolkit Can and Cannot Do
Can Do
- Trace preference formation across layers for any HuggingFace reward model
- Decompose reward scores into per-component (head/MLP) signed contributions
- Identify causally necessary components via activation patching
- Detect reward hacking vulnerabilities (length, confidence, formatting, sycophancy)
- Train SAEs on reward model activations and decompose reward through features
- Compare preference circuits across different reward models
Cannot Do (Honestly)
- Process reward models (PRMs) are partially supported — per-step analysis works, but step-boundary detection and accumulated quality tracking are not yet implemented
- Proprietary models — this toolkit requires access to model weights. API-only models cannot be analyzed
- Causal claims from correlational tools — the reward lens and component attribution are observational. Only activation patching provides causal evidence. We are explicit about this distinction in the API
- Guaranteed completeness — mechanistic interpretability never guarantees you've found everything. The toolkit helps you find what's there, but absence of evidence is not evidence of absence
Compute Requirements
All analyses run in inference mode. No training of the reward model is required.
| Analysis | 8B model | Hardware | Time |
|---|---|---|---|
| Reward Lens (single pair) | ~2 forward passes | 1× GPU (16GB+) | ~5 seconds |
| Component Attribution (single pair) | ~2 forward passes | 1× GPU (16GB+) | ~10 seconds |
| Activation Patching (all components) | ~n_components × 2 forward passes | 1× GPU (24GB+) | ~30 minutes |
| SAE Training (single layer) | Activation collection + training | 1× GPU (24GB+) | ~8-24 hours |
| Full Hacking Scan | ~50 paired forward passes | 1× GPU (16GB+) | ~5 minutes |
Citation
@software{nadaf2026rewardlens,
title = {reward-lens: Mechanistic Interpretability Toolkit for Reward Models},
author = {Nadaf, Mohammed Suhail B},
year = {2026},
url = {https://github.com/suhailnadaf509/reward-lens},
}
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file reward_lens-1.0.0.tar.gz.
File metadata
- Download URL: reward_lens-1.0.0.tar.gz
- Upload date:
- Size: 81.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb4a1e8b2960360ed193cbbdb3253c5952a7dfbb94ce832250e7f4c0c05588fc
|
|
| MD5 |
9bd513adac7e041d25cf137c8bc7a38c
|
|
| BLAKE2b-256 |
35010aa55a2ee9d6d03f4f65a819cfeb98d9b36e7f42c5828f195cb577b17c54
|
File details
Details for the file reward_lens-1.0.0-py3-none-any.whl.
File metadata
- Download URL: reward_lens-1.0.0-py3-none-any.whl
- Upload date:
- Size: 82.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc5a70b4ad94cf99ac783e12c495391641e4529f1f6122583a3df2aaf6abda4a
|
|
| MD5 |
29299a8b754754ca947aaf4842366d81
|
|
| BLAKE2b-256 |
7089c70f616b9cb5d409d33213ec553bb68787c5da1ad79c4bb89d90804bcfad
|