Glassbox 2.0 — Open-source mechanistic interpretability for transformer models
Project description
Glassbox 2.0 - Mechanistic Interpretability
Open-source transformer circuit analysis. Attribution patching in O(3) forward passes, automatic circuit discovery, cross-model alignment scoring, and bootstrap confidence intervals.
Why Glassbox 2.0?
| Method | Passes Required | Comprehensiveness | Circuit Discovery | Cross-Model Alignment |
|---|---|---|---|---|
| Full Activation Patching | O(2N) ~192x | Causal | Manual | None |
| TransformerLens (mean ablation) | O(2N) | Approximate | Manual | None |
| Anthropic ACDC | O(2N) | Causal | Auto | None |
| Glassbox 2.0 | O(3) | Corrupted patching | MFC auto-discovery | FCAS = 0.929 |
96x faster than full activation patching on GPT-2 Small (144 heads = 288 passes vs. 3).
Novel Contributions
-
O(3) Attribution Patching —
attr(h) = grad_z * (z_clean - z_corrupt). Three forward passes regardless of model size. Validated against Wang et al. (2022) IOI ground truth. -
Minimum Faithful Circuit (MFC) — Greedy forward selection (add heads until sufficiency
= 85%) + greedy backward pruning (remove heads while comprehensiveness >= 15%). Automatically finds the smallest causally faithful head set.
-
Functional Circuit Alignment Score (FCAS) —
1 - mean(|rel_depth_A - rel_depth_B|)over matched heads. GPT-2 Small vs GPT-2 Medium: FCAS = 0.929. Name-mover circuits concentrate at relative depth ~0.82 across both scales. -
Bootstrap 95% CI — Percentile bootstrap (n=500) on Sufficiency / Comprehensiveness / F1 across the full prompt distribution. No cherry-picking.
Benchmark Results
| Task | Domain | Suff | Comp | F1 | Category |
|---|---|---|---|---|---|
| IOI (Indirect Object ID) | Logic | 100% | 34.5% +/-14.6% | 49.4% | Moderate |
| SVA (Subject-Verb Agreement) | Grammar | 33.7% +/-4.9% | 51.7% +/-8.6% | 40.7% | Distributed |
| GEO (Country to Capital) | Factual | 90.2% +/-13.9% | 90.0% +/-14.1% | 90.1% | Faithful |
Top IOI head: L9H9 (+4.20) -- consistent with Wang et al. (2022). Top GEO head: L9H8 (+2.32).
Install
pip install git+https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool.git
Quick Start
from glassbox import GlassboxV2
gb = GlassboxV2("gpt2")
result = gb.analyze(
prompt="When Mary and John went to the store, John gave a gift to",
correct=" Mary",
incorrect=" John"
)
print(result["faithfulness"])
# {'suff': 1.0, 'comp': 0.345, 'f1': 0.494, 'category': 'moderate'}
for (layer, head), score in sorted(result["circuit"].items(), key=lambda x: -x[1]):
print(f"L{layer:02d}H{head:02d} {score:.4f}")
CLI
glassbox analyze \
--prompt "When Mary and John went to the store, John gave a gift to" \
--correct " Mary" \
--incorrect " John"
How It Works
Input prompt -> Pass 1 (clean activations, no grad)
-> Pass 2 (corrupted activations, no grad)
-> Pass 3 (gradient pass: patch clean z, backward on logit diff)
attr(layer, head) = grad * (z_clean - z_corrupt) # per head, last position
MFC Discovery:
Phase 1 (forward): add heads by |attr| until suff >= 0.85
Phase 2 (backward): prune heads while comp >= 0.15
Comprehensiveness (corrupted activation patching, Wang et al. 2022):
Replace circuit heads' clean activations with corrupted activations.
comp = 1 - (patched_logit_diff / clean_logit_diff)
Project Structure
glassbox/
|-- core.py # GlassboxV2 engine
|-- cli.py # Command-line interface
|-- __init__.py
benchmarks/ # Standalone evaluation scripts (IOI / SVA / GEO)
tests/ # Unit tests for patching hooks and circuit metrics
dashboard/ # Streamlit web UI for visual circuit inspection
Citation
@software{mahale2026glassbox,
author = {Mahale, Ajay},
title = {Glassbox 2.0: O(3) Attribution Patching and Minimum Faithful Circuit Discovery},
year = {2026},
url = {https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool}
}
References
- Wang et al. (2022). Interpretability in the Wild: a Circuit for IOI in GPT-2 Small. ICLR 2023.
- Conmy et al. (2023). Towards Automated Circuit Discovery for Mechanistic Interpretability. NeurIPS 2023.
- Elhage et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file glassbox_mech_interp-2.0.0.tar.gz.
File metadata
- Download URL: glassbox_mech_interp-2.0.0.tar.gz
- Upload date:
- Size: 12.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd9a837d05ac7acad044380b7856315b99caf43eb0df7267779fbba13b11feaf
|
|
| MD5 |
c8c4b4aa0ec8c8610b94a09b8c97949e
|
|
| BLAKE2b-256 |
4c4f6a381c159735e356c6a6b30e31fbb32182fab706cf6ce4385b0dab265943
|
File details
Details for the file glassbox_mech_interp-2.0.0-py3-none-any.whl.
File metadata
- Download URL: glassbox_mech_interp-2.0.0-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
525d3d6f82279c2d4164fd87349d51e5838b77a3f8fc37a7a10deb2398ff3709
|
|
| MD5 |
521ed2d967ace786a48c478d41d8c921
|
|
| BLAKE2b-256 |
ea33851c40882ffae6fc6c5b9618f8442e248b62cbc007a6e47c35d960d455dc
|