Skip to main content

Glassbox 2.0 — Open-source mechanistic interpretability for transformer models

Project description

Glassbox 2.0 - Mechanistic Interpretability

License: MIT Python 3.8+

Open-source transformer circuit analysis. Attribution patching in O(3) forward passes, automatic circuit discovery, cross-model alignment scoring, and bootstrap confidence intervals.


Why Glassbox 2.0?

Method Passes Required Comprehensiveness Circuit Discovery Cross-Model Alignment
Full Activation Patching O(2N) ~192x Causal Manual None
TransformerLens (mean ablation) O(2N) Approximate Manual None
Anthropic ACDC O(2N) Causal Auto None
Glassbox 2.0 O(3) Corrupted patching MFC auto-discovery FCAS = 0.929

96x faster than full activation patching on GPT-2 Small (144 heads = 288 passes vs. 3).


Novel Contributions

  1. O(3) Attribution Patchingattr(h) = grad_z * (z_clean - z_corrupt). Three forward passes regardless of model size. Validated against Wang et al. (2022) IOI ground truth.

  2. Minimum Faithful Circuit (MFC) — Greedy forward selection (add heads until sufficiency

    = 85%) + greedy backward pruning (remove heads while comprehensiveness >= 15%). Automatically finds the smallest causally faithful head set.

  3. Functional Circuit Alignment Score (FCAS)1 - mean(|rel_depth_A - rel_depth_B|) over matched heads. GPT-2 Small vs GPT-2 Medium: FCAS = 0.929. Name-mover circuits concentrate at relative depth ~0.82 across both scales.

  4. Bootstrap 95% CI — Percentile bootstrap (n=500) on Sufficiency / Comprehensiveness / F1 across the full prompt distribution. No cherry-picking.


Benchmark Results

Task Domain Suff Comp F1 Category
IOI (Indirect Object ID) Logic 100% 34.5% +/-14.6% 49.4% Moderate
SVA (Subject-Verb Agreement) Grammar 33.7% +/-4.9% 51.7% +/-8.6% 40.7% Distributed
GEO (Country to Capital) Factual 90.2% +/-13.9% 90.0% +/-14.1% 90.1% Faithful

Top IOI head: L9H9 (+4.20) -- consistent with Wang et al. (2022). Top GEO head: L9H8 (+2.32).


Install

pip install git+https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool.git

Quick Start

from glassbox import GlassboxV2

gb = GlassboxV2("gpt2")
result = gb.analyze(
    prompt="When Mary and John went to the store, John gave a gift to",
    correct=" Mary",
    incorrect=" John"
)
print(result["faithfulness"])
# {'suff': 1.0, 'comp': 0.345, 'f1': 0.494, 'category': 'moderate'}

for (layer, head), score in sorted(result["circuit"].items(), key=lambda x: -x[1]):
    print(f"L{layer:02d}H{head:02d}  {score:.4f}")

CLI

glassbox analyze \
  --prompt "When Mary and John went to the store, John gave a gift to" \
  --correct " Mary" \
  --incorrect " John"

How It Works

Input prompt -> Pass 1 (clean activations, no grad)
             -> Pass 2 (corrupted activations, no grad)
             -> Pass 3 (gradient pass: patch clean z, backward on logit diff)

attr(layer, head) = grad * (z_clean - z_corrupt)   # per head, last position

MFC Discovery:
  Phase 1 (forward):  add heads by |attr| until suff >= 0.85
  Phase 2 (backward): prune heads while comp >= 0.15

Comprehensiveness (corrupted activation patching, Wang et al. 2022):
  Replace circuit heads' clean activations with corrupted activations.
  comp = 1 - (patched_logit_diff / clean_logit_diff)

Project Structure

glassbox/
|-- core.py          # GlassboxV2 engine
|-- cli.py           # Command-line interface
|-- __init__.py

benchmarks/          # Standalone evaluation scripts (IOI / SVA / GEO)
tests/               # Unit tests for patching hooks and circuit metrics
dashboard/           # Streamlit web UI for visual circuit inspection

Citation

@software{mahale2026glassbox,
  author  = {Mahale, Ajay},
  title   = {Glassbox 2.0: O(3) Attribution Patching and Minimum Faithful Circuit Discovery},
  year    = {2026},
  url     = {https://github.com/designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool}
}

References

  • Wang et al. (2022). Interpretability in the Wild: a Circuit for IOI in GPT-2 Small. ICLR 2023.
  • Conmy et al. (2023). Towards Automated Circuit Discovery for Mechanistic Interpretability. NeurIPS 2023.
  • Elhage et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glassbox_mech_interp-2.0.0.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glassbox_mech_interp-2.0.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file glassbox_mech_interp-2.0.0.tar.gz.

File metadata

  • Download URL: glassbox_mech_interp-2.0.0.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for glassbox_mech_interp-2.0.0.tar.gz
Algorithm Hash digest
SHA256 cd9a837d05ac7acad044380b7856315b99caf43eb0df7267779fbba13b11feaf
MD5 c8c4b4aa0ec8c8610b94a09b8c97949e
BLAKE2b-256 4c4f6a381c159735e356c6a6b30e31fbb32182fab706cf6ce4385b0dab265943

See more details on using hashes here.

File details

Details for the file glassbox_mech_interp-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for glassbox_mech_interp-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 525d3d6f82279c2d4164fd87349d51e5838b77a3f8fc37a7a10deb2398ff3709
MD5 521ed2d967ace786a48c478d41d8c921
BLAKE2b-256 ea33851c40882ffae6fc6c5b9618f8442e248b62cbc007a6e47c35d960d455dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page