Skip to main content

Open-source mechanistic interpretability for transformer models

Project description

Glassbox 2.0

Open-source mechanistic interpretability for transformer models.

PyPI PyPI Downloads License: MIT Python 3.8+ HuggingFace Space arXiv

Live Demo · Docs · PyPI


Glassbox 2.0 identifies the attention heads responsible for a model's prediction, quantifies their causal contribution, and tells you exactly why a transformer made the choice it did — in one function call.

Built on attribution patching with O(3) complexity. Benchmarked against ACDC. Grounded in peer-reviewed mechanistic interpretability research.


Highlights

  • O(3) attribution patching — identifies circuits in a single forward-backward pass, not exhaustive edge enumeration
  • 37x faster than ACDC on GPT-2 small (1.2s vs 43.2s)
  • Bootstrap 95% CI — every faithfulness score from bootstrap_metrics() ships with confidence intervals, not point estimates
  • FCAS cross-model alignment — quantifies how similar circuits are across model sizes (GPT-2 family: 0.783-0.835)
  • Interactive dashboard — Streamlit UI on HuggingFace Spaces, no setup required

Quickstart

pip install glassbox-mech-interp
from transformer_lens import HookedTransformer
from glassbox import GlassboxV2

model = HookedTransformer.from_pretrained("gpt2")
gb    = GlassboxV2(model)

result = gb.analyze(
    prompt    = "When Mary and John went to the store, John gave a drink to",
    correct   = "Mary",
    incorrect = "John",
)

print(result["faithfulness"])
# {
#   "sufficiency":       0.80,   <- Taylor approximation (see note below)
#   "comprehensiveness": 0.37,   <- exact causal value
#   "f1":                0.49,
#   "category":          "moderate",
#   "suff_is_approx":    True
# }

# Circuit is a list of (layer, head) tuples, sorted by attribution score
print(result["circuit"])
# [(9, 9), (8, 10), (7, 3), ...]

# To see attribution scores for each head:
attrs = result["attributions"]
for (layer, head) in result["circuit"]:
    score = attrs.get(str((layer, head)), 0.0)
    print(f"L{layer:02d}H{head:02d} -> {score:.4f}")

# For confidence intervals, use bootstrap_metrics() instead:
boot = gb.bootstrap_metrics(prompts=[
    ("When Mary and John went to the store, John gave a drink to", "Mary", "John"),
    ("When Alice and Bob entered the room, Bob handed the key to", "Alice", "Bob"),
    # ... add more prompts for reliable CIs (recommended n >= 20)
], n_boot=500)
print(boot["sufficiency"])
# {"mean": 0.82, "std": 0.06, "ci_lo": 0.71, "ci_hi": 0.91, "n": 2}

Note on Sufficiency: The sufficiency value in analyze() is a first-order Taylor approximation (Nanda et al. 2023), not the exact causal value. This is why the benchmark table below shows ~80% while the MSc thesis paper reports ~100% — the paper used exact Wang et al. (2022) sufficiency computed by ablating non-circuit heads. Both are valid measures; they differ by methodology. The suff_is_approx: True flag in the output makes this explicit.

Try it instantly — no install needed: huggingface.co/spaces/designer-coderajay/Glassbox-ai


Benchmarks

Evaluated on the IOI (Indirect Object Identification) task across the GPT-2 family.

Model Layers Heads Sufficiency* Comprehensiveness F1 Glassbox ACDC Speedup
GPT-2 small 12 12 80.0% 37.2% 48.8% 1.2s 43.2s 37x
GPT-2 medium 24 16 35.1% 23.7% 27.9% 4.9s 115.2s 24x
GPT-2 large 36 20 18.2% 14.2% 15.9% 14.3s 216.0s 15x

*Sufficiency values are first-order Taylor approximations. Exact causal sufficiency (requiring full ablation runs) is higher — see the arXiv paper for exact values.

Cross-model circuit alignment (FCAS):

Pair FCAS
GPT-2 small <-> GPT-2 medium 0.835
GPT-2 small <-> GPT-2 large 0.783
GPT-2 medium <-> GPT-2 large 0.833

High FCAS scores confirm the IOI circuit is structurally conserved across model scale — consistent with Wang et al. (2022).


How It Works

Glassbox runs attribution patching with name-swap corruption, matching the methodology of Wang et al. (2022).

Clean prompt     ->  model  ->  logit(Mary)
Corrupted prompt ->  model  ->  logit(John)

For each attention head:
  Patch clean activation -> corrupted run
  Measure delta_logit(Mary - John)
  Normalize -> attribution score

Faithfulness metrics follow the ERASER framework:

  • Sufficiency — does the circuit alone recover the clean prediction? (Taylor approx in analyze(), exact in paper)
  • Comprehensiveness — how much does ablating the circuit hurt? (exact causal measurement)
  • F1 — harmonic mean of both

Confidence intervals are available via bootstrap_metrics() with n_boot resamples.


Installation

# From PyPI
pip install glassbox-mech-interp

# From source
git clone https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool
cd Glassbox-AI-2.0-Mechanistic-Interpretability-tool
pip install -e .

Requirements: Python >= 3.8, PyTorch >= 2.0, TransformerLens >= 1.0


Run the Dashboard Locally

pip install glassbox-mech-interp streamlit plotly
git clone https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool
cd Glassbox-AI-2.0-Mechanistic-Interpretability-tool
streamlit run dashboard/app.py

Or use the hosted version at huggingface.co/spaces/designer-coderajay/Glassbox-ai.


API Reference

GlassboxV2(model)

from transformer_lens import HookedTransformer
from glassbox import GlassboxV2

model = HookedTransformer.from_pretrained("gpt2")
gb    = GlassboxV2(model)
Method Description
gb.analyze(prompt, correct, incorrect) Full circuit analysis. Returns dict with circuit, attributions, faithfulness, corr_prompt.
gb.attribution_patching(clean_tokens, corr_tokens, target_id, distractor_id) Raw per-head attribution scores. Expects tokenized tensors and integer token IDs.
gb.bootstrap_metrics(prompts, n_boot, alpha) Bootstrap 95% CI on Suff/Comp/F1. Pass list of (prompt, correct, incorrect) tuples.
gb.functional_circuit_alignment(heads_a, heads_b, top_k, n_null) Cross-model FCAS score with null distribution and z-score.

analyze() return structure:

{
    "circuit":      [(9, 9), (8, 10), ...],       # List of (layer, head) tuples
    "n_heads":      int,
    "clean_ld":     float,                         # logit(correct) - logit(incorrect)
    "corr_prompt":  str,                           # name-swapped corrupted prompt
    "attributions": {"(9, 9)": 0.174, ...},        # string keys, float values
    "faithfulness": {
        "sufficiency":       float,                # Taylor approximation
        "comprehensiveness": float,                # exact causal value
        "f1":                float,
        "category":          str,                  # one of: faithful, backup_mechanisms,
                                                   #   moderate, incomplete, weak
        "suff_is_approx":    True,
    }
}

Citation

If you use Glassbox 2.0 in your research, please cite:

@software{mahale2025glassbox,
  author    = {Mahale, Ajay Pravin},
  title     = {Glassbox 2.0: Causally Grounded Mechanistic Interpretability for Transformer Models},
  year      = {2025},
  publisher = {GitHub},
  url       = {https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool}
}

Related Work

  • Wang et al. (2022) — IOI circuit discovery in GPT-2
  • TransformerLens — mechanistic interpretability library this builds on
  • ACDC — automatic circuit discovery (baseline we benchmark against)
  • ERASER — faithfulness evaluation framework

License

MIT. See LICENSE.


Built by Ajay Pravin Mahale · Made in Germany · Glassbox AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glassbox_mech_interp-2.1.0.tar.gz (30.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glassbox_mech_interp-2.1.0-py3-none-any.whl (31.3 kB view details)

Uploaded Python 3

File details

Details for the file glassbox_mech_interp-2.1.0.tar.gz.

File metadata

  • Download URL: glassbox_mech_interp-2.1.0.tar.gz
  • Upload date:
  • Size: 30.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for glassbox_mech_interp-2.1.0.tar.gz
Algorithm Hash digest
SHA256 4acf67d8a40f072ee102b02d5daca21d12d1389f9d533dd80f35cad38184928e
MD5 25762fd314d0c521ac7139d351e11f7f
BLAKE2b-256 08e966b6ece7f4fe7edae407dba2c5ea50ec67a52f54d16066764a6ba63ef3f3

See more details on using hashes here.

File details

Details for the file glassbox_mech_interp-2.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for glassbox_mech_interp-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f6cbae2d2743f0c1939e4c00310c0c294e363b9d141c264d3c41546722b505a6
MD5 63de19d68afff981f5c0505238ffb874
BLAKE2b-256 b303f48cb01dc1234b49d53878d60b34b5cbbc98366ec2a53713cbd92423bbb7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page