Open-source mechanistic interpretability for transformer models
Project description
Glassbox 2.0
Open-source mechanistic interpretability for transformer models.
Glassbox 2.0 identifies the attention heads responsible for a model's prediction, quantifies their causal contribution, and tells you exactly why a transformer made the choice it did — in one function call.
Built on attribution patching with O(3) complexity. Benchmarked against ACDC. Grounded in peer-reviewed mechanistic interpretability research.
Highlights
- O(3) attribution patching — identifies circuits in a single forward-backward pass, not exhaustive edge enumeration
- 37x faster than ACDC on GPT-2 small (1.2s vs 43.2s)
- Bootstrap 95% CI — every faithfulness score from
bootstrap_metrics()ships with confidence intervals, not point estimates - FCAS cross-model alignment — quantifies how similar circuits are across model sizes (GPT-2 family: 0.783-0.835)
- Interactive dashboard — Streamlit UI on HuggingFace Spaces, no setup required
Quickstart
pip install glassbox-mech-interp
from transformer_lens import HookedTransformer
from glassbox import GlassboxV2
model = HookedTransformer.from_pretrained("gpt2")
gb = GlassboxV2(model)
result = gb.analyze(
prompt = "When Mary and John went to the store, John gave a drink to",
correct = "Mary",
incorrect = "John",
)
print(result["faithfulness"])
# {
# "sufficiency": 0.80, <- Taylor approximation (see note below)
# "comprehensiveness": 0.37, <- exact causal value
# "f1": 0.49,
# "category": "moderate",
# "suff_is_approx": True
# }
# Circuit is a list of (layer, head) tuples, sorted by attribution score
print(result["circuit"])
# [(9, 9), (8, 10), (7, 3), ...]
# To see attribution scores for each head:
attrs = result["attributions"]
for (layer, head) in result["circuit"]:
score = attrs.get(str((layer, head)), 0.0)
print(f"L{layer:02d}H{head:02d} -> {score:.4f}")
# For confidence intervals, use bootstrap_metrics() instead:
boot = gb.bootstrap_metrics(prompts=[
("When Mary and John went to the store, John gave a drink to", "Mary", "John"),
("When Alice and Bob entered the room, Bob handed the key to", "Alice", "Bob"),
# ... add more prompts for reliable CIs (recommended n >= 20)
], n_boot=500)
print(boot["sufficiency"])
# {"mean": 0.82, "std": 0.06, "ci_lo": 0.71, "ci_hi": 0.91, "n": 2}
Note on Sufficiency: The
sufficiencyvalue inanalyze()is a first-order Taylor approximation (Nanda et al. 2023), not the exact causal value. This is why the benchmark table below shows ~80% while the MSc thesis paper reports ~100% — the paper used exact Wang et al. (2022) sufficiency computed by ablating non-circuit heads. Both are valid measures; they differ by methodology. Thesuff_is_approx: Trueflag in the output makes this explicit.
Try it instantly — no install needed: huggingface.co/spaces/designer-coderajay/Glassbox-ai
Benchmarks
Evaluated on the IOI (Indirect Object Identification) task across the GPT-2 family.
| Model | Layers | Heads | Sufficiency* | Comprehensiveness | F1 | Glassbox | ACDC | Speedup |
|---|---|---|---|---|---|---|---|---|
| GPT-2 small | 12 | 12 | 80.0% | 37.2% | 48.8% | 1.2s | 43.2s | 37x |
| GPT-2 medium | 24 | 16 | 35.1% | 23.7% | 27.9% | 4.9s | 115.2s | 24x |
| GPT-2 large | 36 | 20 | 18.2% | 14.2% | 15.9% | 14.3s | 216.0s | 15x |
*Sufficiency values are first-order Taylor approximations. Exact causal sufficiency (requiring full ablation runs) is higher — see the arXiv paper for exact values.
Cross-model circuit alignment (FCAS):
| Pair | FCAS |
|---|---|
| GPT-2 small <-> GPT-2 medium | 0.835 |
| GPT-2 small <-> GPT-2 large | 0.783 |
| GPT-2 medium <-> GPT-2 large | 0.833 |
High FCAS scores confirm the IOI circuit is structurally conserved across model scale — consistent with Wang et al. (2022).
How It Works
Glassbox runs attribution patching with name-swap corruption, matching the methodology of Wang et al. (2022).
Clean prompt -> model -> logit(Mary)
Corrupted prompt -> model -> logit(John)
For each attention head:
Patch clean activation -> corrupted run
Measure delta_logit(Mary - John)
Normalize -> attribution score
Faithfulness metrics follow the ERASER framework:
- Sufficiency — does the circuit alone recover the clean prediction? (Taylor approx in
analyze(), exact in paper) - Comprehensiveness — how much does ablating the circuit hurt? (exact causal measurement)
- F1 — harmonic mean of both
Confidence intervals are available via bootstrap_metrics() with n_boot resamples.
Installation
# From PyPI
pip install glassbox-mech-interp
# From source
git clone https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool
cd Glassbox-AI-2.0-Mechanistic-Interpretability-tool
pip install -e .
Requirements: Python >= 3.8, PyTorch >= 2.0, TransformerLens >= 1.0
Run the Dashboard Locally
pip install glassbox-mech-interp streamlit plotly
git clone https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool
cd Glassbox-AI-2.0-Mechanistic-Interpretability-tool
streamlit run dashboard/app.py
Or use the hosted version at huggingface.co/spaces/designer-coderajay/Glassbox-ai.
API Reference
GlassboxV2(model)
from transformer_lens import HookedTransformer
from glassbox import GlassboxV2
model = HookedTransformer.from_pretrained("gpt2")
gb = GlassboxV2(model)
| Method | Description |
|---|---|
gb.analyze(prompt, correct, incorrect) |
Full circuit analysis. Returns dict with circuit, attributions, faithfulness, corr_prompt. |
gb.attribution_patching(clean_tokens, corr_tokens, target_id, distractor_id) |
Raw per-head attribution scores. Expects tokenized tensors and integer token IDs. |
gb.bootstrap_metrics(prompts, n_boot, alpha) |
Bootstrap 95% CI on Suff/Comp/F1. Pass list of (prompt, correct, incorrect) tuples. |
gb.functional_circuit_alignment(heads_a, heads_b, top_k, n_null) |
Cross-model FCAS score with null distribution and z-score. |
analyze() return structure:
{
"circuit": [(9, 9), (8, 10), ...], # List of (layer, head) tuples
"n_heads": int,
"clean_ld": float, # logit(correct) - logit(incorrect)
"corr_prompt": str, # name-swapped corrupted prompt
"attributions": {"(9, 9)": 0.174, ...}, # string keys, float values
"faithfulness": {
"sufficiency": float, # Taylor approximation
"comprehensiveness": float, # exact causal value
"f1": float,
"category": str, # one of: faithful, backup_mechanisms,
# moderate, incomplete, weak
"suff_is_approx": True,
}
}
Citation
If you use Glassbox 2.0 in your research, please cite:
@software{mahale2025glassbox,
author = {Mahale, Ajay Pravin},
title = {Glassbox 2.0: Causally Grounded Mechanistic Interpretability for Transformer Models},
year = {2025},
publisher = {GitHub},
url = {https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool}
}
Related Work
- Wang et al. (2022) — IOI circuit discovery in GPT-2
- TransformerLens — mechanistic interpretability library this builds on
- ACDC — automatic circuit discovery (baseline we benchmark against)
- ERASER — faithfulness evaluation framework
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file glassbox_mech_interp-2.1.0.tar.gz.
File metadata
- Download URL: glassbox_mech_interp-2.1.0.tar.gz
- Upload date:
- Size: 30.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4acf67d8a40f072ee102b02d5daca21d12d1389f9d533dd80f35cad38184928e
|
|
| MD5 |
25762fd314d0c521ac7139d351e11f7f
|
|
| BLAKE2b-256 |
08e966b6ece7f4fe7edae407dba2c5ea50ec67a52f54d16066764a6ba63ef3f3
|
File details
Details for the file glassbox_mech_interp-2.1.0-py3-none-any.whl.
File metadata
- Download URL: glassbox_mech_interp-2.1.0-py3-none-any.whl
- Upload date:
- Size: 31.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6cbae2d2743f0c1939e4c00310c0c294e363b9d141c264d3c41546722b505a6
|
|
| MD5 |
63de19d68afff981f5c0505238ffb874
|
|
| BLAKE2b-256 |
b303f48cb01dc1234b49d53878d60b34b5cbbc98366ec2a53713cbd92423bbb7
|