Glassbox 2.0 — Open-source mechanistic interpretability for transformer models

These details have not been verified by PyPI

Project links

Project description

Glassbox 2.0 - Mechanistic Interpretability

Open-source transformer circuit analysis. Attribution patching in O(3) forward passes, automatic circuit discovery, cross-model alignment scoring, and bootstrap confidence intervals.

Why Glassbox 2.0?

Method	Passes Required	Comprehensiveness	Circuit Discovery	Cross-Model Alignment
Full Activation Patching	O(2N) ~192x	Causal	Manual	None
TransformerLens (mean ablation)	O(2N)	Approximate	Manual	None
Anthropic ACDC	O(2N)	Causal	Auto	None
Glassbox 2.0	O(3)	Corrupted patching	MFC auto-discovery	FCAS = 0.929

96x faster than full activation patching on GPT-2 Small (144 heads = 288 passes vs. 3).

Novel Contributions

O(3) Attribution Patching — attr(h) = grad_z * (z_clean - z_corrupt). Three forward passes regardless of model size. Validated against Wang et al. (2022) IOI ground truth.
Minimum Faithful Circuit (MFC) — Greedy forward selection (add heads until sufficiency

= 85%) + greedy backward pruning (remove heads while comprehensiveness >= 15%). Automatically finds the smallest causally faithful head set.
Functional Circuit Alignment Score (FCAS) — 1 - mean(|rel_depth_A - rel_depth_B|) over matched heads. GPT-2 Small vs GPT-2 Medium: FCAS = 0.929. Name-mover circuits concentrate at relative depth ~0.82 across both scales.
Bootstrap 95% CI — Percentile bootstrap (n=500) on Sufficiency / Comprehensiveness / F1 across the full prompt distribution. No cherry-picking.

Benchmark Results

Task	Domain	Suff	Comp	F1	Category
IOI (Indirect Object ID)	Logic	100%	34.5% +/-14.6%	49.4%	Moderate
SVA (Subject-Verb Agreement)	Grammar	33.7% +/-4.9%	51.7% +/-8.6%	40.7%	Distributed
GEO (Country to Capital)	Factual	90.2% +/-13.9%	90.0% +/-14.1%	90.1%	Faithful

Top IOI head: L9H9 (+4.20) -- consistent with Wang et al. (2022). Top GEO head: L9H8 (+2.32).

Install

pip install git+https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool.git

Quick Start

from glassbox import GlassboxV2

gb = GlassboxV2("gpt2")
result = gb.analyze(
    prompt="When Mary and John went to the store, John gave a gift to",
    correct=" Mary",
    incorrect=" John"
)
print(result["faithfulness"])
# {'suff': 1.0, 'comp': 0.345, 'f1': 0.494, 'category': 'moderate'}

for (layer, head), score in sorted(result["circuit"].items(), key=lambda x: -x[1]):
    print(f"L{layer:02d}H{head:02d}  {score:.4f}")

CLI

glassbox analyze \
  --prompt "When Mary and John went to the store, John gave a gift to" \
  --correct " Mary" \
  --incorrect " John"

How It Works

Input prompt -> Pass 1 (clean activations, no grad)
             -> Pass 2 (corrupted activations, no grad)
             -> Pass 3 (gradient pass: patch clean z, backward on logit diff)

attr(layer, head) = grad * (z_clean - z_corrupt)   # per head, last position

MFC Discovery:
  Phase 1 (forward):  add heads by |attr| until suff >= 0.85
  Phase 2 (backward): prune heads while comp >= 0.15

Comprehensiveness (corrupted activation patching, Wang et al. 2022):
  Replace circuit heads' clean activations with corrupted activations.
  comp = 1 - (patched_logit_diff / clean_logit_diff)

Project Structure

glassbox/
|-- core.py          # GlassboxV2 engine
|-- cli.py           # Command-line interface
|-- __init__.py

benchmarks/          # Standalone evaluation scripts (IOI / SVA / GEO)
tests/               # Unit tests for patching hooks and circuit metrics
dashboard/           # Streamlit web UI for visual circuit inspection

Citation

@software{mahale2026glassbox,
  author  = {Mahale, Ajay},
  title   = {Glassbox 2.0: O(3) Attribution Patching and Minimum Faithful Circuit Discovery},
  year    = {2026},
  url     = {https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool}
}

References

Wang et al. (2022). Interpretability in the Wild: a Circuit for IOI in GPT-2 Small. ICLR 2023.
Conmy et al. (2023). Towards Automated Circuit Discovery for Mechanistic Interpretability. NeurIPS 2023.
Elhage et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.2.6

Apr 17, 2026

4.2.5

Apr 16, 2026

4.2.4

Apr 4, 2026

4.2.3

Apr 4, 2026

4.2.2

Apr 4, 2026

4.2.1

Apr 3, 2026

4.2.0

Apr 3, 2026

4.1.1

Apr 3, 2026

4.1.0

Apr 3, 2026

4.0.0

Apr 3, 2026

3.7.0

Apr 3, 2026

3.6.0

Apr 2, 2026

3.5.0

Apr 1, 2026

3.4.0

Mar 21, 2026

3.2.1

Mar 20, 2026

3.1.0

Mar 20, 2026

3.0.0

Mar 20, 2026

2.7.0

Mar 17, 2026

2.6.0

Mar 17, 2026

2.1.0

Mar 16, 2026

2.0.1

Mar 15, 2026

This version

2.0.0

Feb 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glassbox_mech_interp-2.0.0.tar.gz (12.1 kB view details)

Uploaded Feb 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

glassbox_mech_interp-2.0.0-py3-none-any.whl (10.7 kB view details)

Uploaded Feb 21, 2026 Python 3

File details

Details for the file glassbox_mech_interp-2.0.0.tar.gz.

File metadata

Download URL: glassbox_mech_interp-2.0.0.tar.gz
Upload date: Feb 21, 2026
Size: 12.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for glassbox_mech_interp-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`cd9a837d05ac7acad044380b7856315b99caf43eb0df7267779fbba13b11feaf`
MD5	`c8c4b4aa0ec8c8610b94a09b8c97949e`
BLAKE2b-256	`4c4f6a381c159735e356c6a6b30e31fbb32182fab706cf6ce4385b0dab265943`

See more details on using hashes here.

File details

Details for the file glassbox_mech_interp-2.0.0-py3-none-any.whl.

File metadata

Download URL: glassbox_mech_interp-2.0.0-py3-none-any.whl
Upload date: Feb 21, 2026
Size: 10.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for glassbox_mech_interp-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`525d3d6f82279c2d4164fd87349d51e5838b77a3f8fc37a7a10deb2398ff3709`
MD5	`521ed2d967ace786a48c478d41d8c921`
BLAKE2b-256	`ea33851c40882ffae6fc6c5b9618f8442e248b62cbc007a6e47c35d960d455dc`

See more details on using hashes here.

glassbox-mech-interp 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Glassbox 2.0 - Mechanistic Interpretability

Why Glassbox 2.0?

Novel Contributions

Benchmark Results

Install

Quick Start

CLI

How It Works

Project Structure

Citation

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes