Criterion-driven feature discovery, explanation, causal testing, and cross-model activation matching.
Project description
interp-lab
interp-lab is an open-source toolkit for criterion-driven mechanistic interpretability.
Give it a model, a plain-language criterion, and feature evidence. It ranks the internal features that track the criterion, explains them, tests their causal impact with interventions, and searches for equivalent features in other models, then grades how much each claim is supported by evidence.
python -m pip install interp-lab
interp-lab doctor
interp-lab quickstart # a short guided walkthrough of the workflow and metrics
# A complete tour on toy models in one command — no GPU, no downloads:
interp-lab demo --out reports/demo # then open reports/demo/index.html
interp-lab inspect \
--model google/gemma-2-2b \
--criterion "the model is aware it is being evaluated" \
--backend toy \
--out reports/eval-awareness
Features
- Correlational vs. causal evidence are kept separate. Association comes from activation/criterion statistics; causal effect comes from ablation, amplification, clamp, patch, and steering runs. A feature that merely co-activates is not treated like one that moves the behavior.
- Claims are graded, not asserted.
validate-matchesandvalidate-attribution-graphmark each result asvalidated,needs_causal_evidence,plausible,contradicted, orweak, with reason codes. - Controls and uncertainty are first-class. Intervention runs support
random_feature,matched_frequency, andplacebocontrols, side-effect checks, sign-consistency, and confidence intervals. - Everything is reproducible and agent-friendly. Runs emit manifests with the tool version, platform, and input hashes; reports include
agent_next_actionswith exact follow-up commands;interp_lab.public_api_contract()exposes the stable surface as data.
The workflow
- Compile a natural-language criterion into examples and scores.
- Collect candidate features from SAEs, NLA explanations, or feature dumps — or feed any latents (crosscoders included) through the model-agnostic activation-records path.
- Rank features by criterion association, specificity, causal evidence, and stability.
- Build a feature fingerprint that can be compared across models.
- Validate cross-model equivalents with interventions.
from interp_lab import compare, inspect, validate_matches
left = inspect("toy/model-a", "the model is aware it is being evaluated", backend="toy", out="reports/model-a")
right = inspect("toy/model-b", "the model is aware it is being evaluated", backend="toy", out="reports/model-b")
matches = compare(left.report, right.report, out="reports/matches.json")
validation = validate_matches(matches.report, out="reports/match-validation.json")
Evidence sources
interp-lab keeps portable JSONL evidence formats stable in the base package; heavier model tooling lives behind optional extras. Supported paths include toy, JSONL feature dumps, activation records, Neuronpedia, SAE Lens, Goodfire, Gemma Scope / Qwen-Scope, Hugging Face, TransformerLens, NNsight, contrast-direction, and on-demand SAE training. Each integration is an optional bridge (pip install "interp-lab[saelens]", [hf], [transformerlens], [nnsight], [goodfire], [modal], [publish], …).
Architecture
The core object is a FeatureFingerprint:
activation signature
+ text explanation embedding
+ decoder signature
+ causal effect vector
+ examples
Cross-model equivalence is scored by fingerprint similarity; validate-matches turns candidates into explicit evidence grades. The pipeline is built around four small adapter interfaces, so new backends are easy to add:
FeatureProvider— returns candidate features.Verbalizer— adds NLA-style text explanations.InterventionRunner— ablates, amplifies, patches, or estimates causal effects.CriterionCompiler— turns natural-language criteria into examples and scoring hints.
Text matching: lexical by default, semantic when you want it
The text component of a fingerprint defaults to a dependency-free lexical vector (token hashing) — deterministic, offline, and comparable across versions, but it matches shared words, not meaning. For real cross-model and cross-vocabulary matching, opt into a semantic embedder:
pip install "interp-lab[embeddings]"
# Local MiniLM (sentence-transformers): free, offline, no API key.
interp-lab inspect ... --text-embedder minilm
# or set once for a whole pipeline:
export INTERP_LAB_TEXT_EMBEDDER=minilm
Each fingerprint records the embedder that produced it, and matching refuses to compare vectors from different embedders (it drops the text component and renormalizes rather than silently cosine-ing across incompatible axes). interp-lab doctor shows the active embedder and whether the extra is installed.
Note: ranking importance weights are heuristic — treat scores as evidence-weighted rankings, not probabilities.
See docs/ARCHITECTURE.md for the full design.
For AI agents
AGENTS.md is the operating manual for coding agents driving interp-lab: the evidence rules, the canonical agent_next_actions shape, and the core loop as runnable commands. interp-lab capabilities --json returns the whole surface — command specs, the Python API contract, environment, and conventions — in one machine-readable payload, and interp-lab mcp serves the core workflow as Model Context Protocol tools over stdio.
Documentation
- Full command reference — every CLI command and the JSONL data formats (feature dumps, activation records, intervention records).
docs/PYTHON_API.md— the Python API.docs/GOLDEN_REAL_MODEL_DEMO.md— a compact real-model walkthrough (trains a small DistilGPT-2 SAE, suppresses latents, re-inspects with causal evidence, exports an attribution graph).- Archived DistilGPT-2 run — real committed artifacts from that walkthrough: a measured criterion-promoting SAE latent, an authentic suppression dose-response, and semantic (MiniLM) fingerprints. Open
inspect-causal/report.htmlto see the numbers. docs/REAL_MODEL_DEMOS.mdandexamples/real_model_demos/— the broader real-model suite.docs/GEMMA4_WALKTHROUGH.mdanddocs/SCALING.md— large-model and 1T+ paths.
Common entry points:
interp-lab demo --out reports/demo # full toy tour (open reports/demo/index.html)
interp-lab quickstart # guided getting-started walkthrough
interp-lab inspect ... --csv-out features.csv # ranked features as a spreadsheet
interp-lab compare-runs --left a/report.json --right b/report.json --out diff.json # rank/score drift
interp-lab studio --serve --reports-dir reports # local browser command-builder + runner (persistent job history)
interp-lab release-check --strict # stable-release readiness
Roadmap
- Richer Natural Language Autoencoder explanation audits.
- Crosscoder training and import.
- Distributed SAE training manifests.
- Remote causal validation workers.
- Feature transfer tests across model families.
- Public example gallery with archived real-model reports (started — see
examples/real_model_demos/).
Development
python -m pip install -e ".[dev]"
python -m pytest
MIT licensed. Contributions welcome — see the issue tracker.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file interp_lab-2.3.0.tar.gz.
File metadata
- Download URL: interp_lab-2.3.0.tar.gz
- Upload date:
- Size: 866.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36325825ae5a47f2cfa762dd6954ef5446904b8cd03317b1d6825246ab597e4c
|
|
| MD5 |
b19c7f92f2bccad4e8bd7bf9d8d6c841
|
|
| BLAKE2b-256 |
2b6cd789be3a7e0c3918b1906095a5577b2dc673020fb9c49c341b5aa7cc453c
|
Provenance
The following attestation bundles were made for interp_lab-2.3.0.tar.gz:
Publisher:
publish.yml on asystemoffields/interp-lab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
interp_lab-2.3.0.tar.gz -
Subject digest:
36325825ae5a47f2cfa762dd6954ef5446904b8cd03317b1d6825246ab597e4c - Sigstore transparency entry: 1774910918
- Sigstore integration time:
-
Permalink:
asystemoffields/interp-lab@23c345ed318b399ea9844f0558261fc6a5c22da2 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/asystemoffields
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@23c345ed318b399ea9844f0558261fc6a5c22da2 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file interp_lab-2.3.0-py3-none-any.whl.
File metadata
- Download URL: interp_lab-2.3.0-py3-none-any.whl
- Upload date:
- Size: 280.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c0e822763d71c789d017b6834fb2ae4462f31722382463772cf2cffd815ec7e
|
|
| MD5 |
22b7a3b3e7f6065ca50c2dc14015d82e
|
|
| BLAKE2b-256 |
aa7025e2cd0c783339f543cd0acc5741e4be6859f26d92172818d8700963e447
|
Provenance
The following attestation bundles were made for interp_lab-2.3.0-py3-none-any.whl:
Publisher:
publish.yml on asystemoffields/interp-lab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
interp_lab-2.3.0-py3-none-any.whl -
Subject digest:
5c0e822763d71c789d017b6834fb2ae4462f31722382463772cf2cffd815ec7e - Sigstore transparency entry: 1774911011
- Sigstore integration time:
-
Permalink:
asystemoffields/interp-lab@23c345ed318b399ea9844f0558261fc6a5c22da2 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/asystemoffields
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@23c345ed318b399ea9844f0558261fc6a5c22da2 -
Trigger Event:
workflow_dispatch
-
Statement type: