Open-source mechanistic interpretability for transformer models

These details have not been verified by PyPI

Project links

Project description

Glassbox 2.0

Open-source mechanistic interpretability for transformer models.

Glassbox 2.0 identifies the attention heads responsible for a model's prediction, quantifies their causal contribution, and tells you exactly why a transformer made the choice it did — in one function call.

Built on attribution patching with O(3) complexity. Benchmarked against ACDC. Grounded in peer-reviewed mechanistic interpretability research.

Highlights

O(3) attribution patching — identifies circuits in a single forward-backward pass, not exhaustive edge enumeration
37x faster than ACDC on GPT-2 small (1.2s vs 43.2s)
Bootstrap 95% CI — every faithfulness score from bootstrap_metrics() ships with confidence intervals, not point estimates
FCAS cross-model alignment — quantifies how similar circuits are across model sizes (GPT-2 family: 0.783-0.835)
Interactive dashboard — Streamlit UI on HuggingFace Spaces, no setup required

Quickstart

pip install glassbox-mech-interp

from transformer_lens import HookedTransformer
from glassbox import GlassboxV2

model = HookedTransformer.from_pretrained("gpt2")
gb    = GlassboxV2(model)

result = gb.analyze(
    prompt    = "When Mary and John went to the store, John gave a drink to",
    correct   = "Mary",
    incorrect = "John",
)

print(result["faithfulness"])
# {
#   "sufficiency":       0.80,   <- Taylor approximation (see note below)
#   "comprehensiveness": 0.37,   <- exact causal value
#   "f1":                0.49,
#   "category":          "moderate",
#   "suff_is_approx":    True
# }

# Circuit is a list of (layer, head) tuples, sorted by attribution score
print(result["circuit"])
# [(9, 9), (8, 10), (7, 3), ...]

# To see attribution scores for each head:
attrs = result["attributions"]
for (layer, head) in result["circuit"]:
    score = attrs.get(str((layer, head)), 0.0)
    print(f"L{layer:02d}H{head:02d} -> {score:.4f}")

# For confidence intervals, use bootstrap_metrics() instead:
boot = gb.bootstrap_metrics(prompts=[
    ("When Mary and John went to the store, John gave a drink to", "Mary", "John"),
    ("When Alice and Bob entered the room, Bob handed the key to", "Alice", "Bob"),
    # ... add more prompts for reliable CIs (recommended n >= 20)
], n_boot=500)
print(boot["sufficiency"])
# {"mean": 0.82, "std": 0.06, "ci_lo": 0.71, "ci_hi": 0.91, "n": 2}

Note on Sufficiency: The sufficiency value in analyze() is a first-order Taylor approximation (Nanda et al. 2023), not the exact causal value. This is why the benchmark table below shows ~80% while the MSc thesis paper reports ~100% — the paper used exact Wang et al. (2022) sufficiency computed by ablating non-circuit heads. Both are valid measures; they differ by methodology. The suff_is_approx: True flag in the output makes this explicit.

Try it instantly — no install needed: huggingface.co/spaces/designer-coderajay/Glassbox-ai

Benchmarks

Evaluated on the IOI (Indirect Object Identification) task across the GPT-2 family.

Model	Layers	Heads	Sufficiency*	Comprehensiveness	F1	Glassbox	ACDC	Speedup
GPT-2 small	12	12	80.0%	37.2%	48.8%	1.2s	43.2s	37x
GPT-2 medium	24	16	35.1%	23.7%	27.9%	4.9s	115.2s	24x
GPT-2 large	36	20	18.2%	14.2%	15.9%	14.3s	216.0s	15x

*Sufficiency values are first-order Taylor approximations. Exact causal sufficiency (requiring full ablation runs) is higher — see the arXiv paper for exact values.

Cross-model circuit alignment (FCAS):

Pair	FCAS
GPT-2 small <-> GPT-2 medium	0.835
GPT-2 small <-> GPT-2 large	0.783
GPT-2 medium <-> GPT-2 large	0.833

High FCAS scores confirm the IOI circuit is structurally conserved across model scale — consistent with Wang et al. (2022).

How It Works

Glassbox runs attribution patching with name-swap corruption, matching the methodology of Wang et al. (2022).

Clean prompt     ->  model  ->  logit(Mary)
Corrupted prompt ->  model  ->  logit(John)

For each attention head:
  Patch clean activation -> corrupted run
  Measure delta_logit(Mary - John)
  Normalize -> attribution score

Faithfulness metrics follow the ERASER framework:

Sufficiency — does the circuit alone recover the clean prediction? (Taylor approx in analyze(), exact in paper)
Comprehensiveness — how much does ablating the circuit hurt? (exact causal measurement)
F1 — harmonic mean of both

Confidence intervals are available via bootstrap_metrics() with n_boot resamples.

Installation

# From PyPI
pip install glassbox-mech-interp

# From source
git clone https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool
cd Glassbox-AI-2.0-Mechanistic-Interpretability-tool
pip install -e .

Requirements: Python >= 3.8, PyTorch >= 2.0, TransformerLens >= 1.0

Run the Dashboard Locally

pip install glassbox-mech-interp streamlit plotly
git clone https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool
cd Glassbox-AI-2.0-Mechanistic-Interpretability-tool
streamlit run dashboard/app.py

Or use the hosted version at huggingface.co/spaces/designer-coderajay/Glassbox-ai.

API Reference

`GlassboxV2(model)`

from transformer_lens import HookedTransformer
from glassbox import GlassboxV2

model = HookedTransformer.from_pretrained("gpt2")
gb    = GlassboxV2(model)

Method	Description
`gb.analyze(prompt, correct, incorrect)`	Full circuit analysis. Returns dict with `circuit`, `attributions`, `faithfulness`, `corr_prompt`.
`gb.attribution_patching(clean_tokens, corr_tokens, target_id, distractor_id)`	Raw per-head attribution scores. Expects tokenized tensors and integer token IDs.
`gb.bootstrap_metrics(prompts, n_boot, alpha)`	Bootstrap 95% CI on Suff/Comp/F1. Pass list of (prompt, correct, incorrect) tuples.
`gb.functional_circuit_alignment(heads_a, heads_b, top_k, n_null)`	Cross-model FCAS score with null distribution and z-score.

analyze() return structure:

{
    "circuit":      [(9, 9), (8, 10), ...],       # List of (layer, head) tuples
    "n_heads":      int,
    "clean_ld":     float,                         # logit(correct) - logit(incorrect)
    "corr_prompt":  str,                           # name-swapped corrupted prompt
    "attributions": {"(9, 9)": 0.174, ...},        # string keys, float values
    "faithfulness": {
        "sufficiency":       float,                # Taylor approximation
        "comprehensiveness": float,                # exact causal value
        "f1":                float,
        "category":          str,                  # one of: faithful, backup_mechanisms,
                                                   #   moderate, incomplete, weak
        "suff_is_approx":    True,
    }
}

Citation

If you use Glassbox 2.0 in your research, please cite:

@software{mahale2025glassbox,
  author    = {Mahale, Ajay Pravin},
  title     = {Glassbox 2.0: Causally Grounded Mechanistic Interpretability for Transformer Models},
  year      = {2025},
  publisher = {GitHub},
  url       = {https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool}
}

Related Work

Wang et al. (2022) — IOI circuit discovery in GPT-2
TransformerLens — mechanistic interpretability library this builds on
ACDC — automatic circuit discovery (baseline we benchmark against)
ERASER — faithfulness evaluation framework

License

MIT. See LICENSE.

Built by Ajay Pravin Mahale · Made in Germany · Glassbox AI

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.2.6

Apr 17, 2026

4.2.5

Apr 16, 2026

4.2.4

Apr 4, 2026

4.2.3

Apr 4, 2026

4.2.2

Apr 4, 2026

4.2.1

Apr 3, 2026

4.2.0

Apr 3, 2026

4.1.1

Apr 3, 2026

4.1.0

Apr 3, 2026

4.0.0

Apr 3, 2026

3.7.0

Apr 3, 2026

3.6.0

Apr 2, 2026

3.5.0

Apr 1, 2026

3.4.0

Mar 21, 2026

3.2.1

Mar 20, 2026

3.1.0

Mar 20, 2026

3.0.0

Mar 20, 2026

2.7.0

Mar 17, 2026

2.6.0

Mar 17, 2026

This version

2.1.0

Mar 16, 2026

2.0.1

Mar 15, 2026

2.0.0

Feb 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glassbox_mech_interp-2.1.0.tar.gz (30.1 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

glassbox_mech_interp-2.1.0-py3-none-any.whl (31.3 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file glassbox_mech_interp-2.1.0.tar.gz.

File metadata

Download URL: glassbox_mech_interp-2.1.0.tar.gz
Upload date: Mar 16, 2026
Size: 30.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for glassbox_mech_interp-2.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4acf67d8a40f072ee102b02d5daca21d12d1389f9d533dd80f35cad38184928e`
MD5	`25762fd314d0c521ac7139d351e11f7f`
BLAKE2b-256	`08e966b6ece7f4fe7edae407dba2c5ea50ec67a52f54d16066764a6ba63ef3f3`

See more details on using hashes here.

File details

Details for the file glassbox_mech_interp-2.1.0-py3-none-any.whl.

File metadata

Download URL: glassbox_mech_interp-2.1.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 31.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for glassbox_mech_interp-2.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f6cbae2d2743f0c1939e4c00310c0c294e363b9d141c264d3c41546722b505a6`
MD5	`63de19d68afff981f5c0505238ffb874`
BLAKE2b-256	`b303f48cb01dc1234b49d53878d60b34b5cbbc98366ec2a53713cbd92423bbb7`

See more details on using hashes here.

glassbox-mech-interp 2.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Project description

Glassbox 2.0

Highlights

Quickstart

Benchmarks

How It Works

Installation

Run the Dashboard Locally

API Reference

`GlassboxV2(model)`

Citation

Related Work

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes