Criterion-driven feature discovery, explanation, causal testing, and cross-model activation matching.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

asystemoffields

These details have not been verified by PyPI

Project description

interp-lab

interp-lab is an open-source toolkit for criterion-driven mechanistic interpretability.

Give it a model, a plain-language criterion, and feature evidence. It ranks the internal features that track the criterion, explains them, tests their causal impact with interventions, and searches for equivalent features in other models — then grades how much each claim is actually supported by evidence.

python -m pip install interp-lab
interp-lab doctor

# A complete tour on toy models in one command — no GPU, no downloads:
interp-lab demo --out reports/demo

interp-lab inspect \
  --model google/gemma-2-2b \
  --criterion "the model is aware it is being evaluated" \
  --backend toy \
  --out reports/eval-awareness

What makes this different

Most interpretability tooling reports correlations and lets you infer the rest. interp-lab is built to resist overclaiming:

Correlational vs. causal evidence are kept separate. Association comes from activation/criterion statistics; causal effect comes from real ablation, amplification, clamp, patch, and steering runs. A feature that merely co-activates is not treated like one that moves the behavior.
Claims are graded, not asserted. validate-matches and validate-attribution-graph mark each result as validated, needs_causal_evidence, plausible, contradicted, or weak, with reason codes.
Controls and uncertainty are first-class. Intervention runs support random_feature, matched_frequency, and placebo controls, side-effect checks, sign-consistency, and confidence intervals. Cross-model matches with opposite-sign effects are explicitly not called equivalent.
Everything is reproducible and agent-friendly. Runs emit manifests with the tool version, platform, and input hashes; reports include agent_next_actions with exact follow-up commands; interp_lab.public_api_contract() exposes the stable surface as data.

The workflow

Compile a natural-language criterion into examples and scores.
Collect candidate features from SAEs, crosscoders, NLA explanations, or feature dumps.
Rank features by criterion association, specificity, causal evidence, and stability.
Build a feature fingerprint that can be compared across models.
Validate cross-model equivalents with interventions.

from interp_lab import compare, inspect, validate_matches

left = inspect("toy/model-a", "the model is aware it is being evaluated", backend="toy", out="reports/model-a")
right = inspect("toy/model-b", "the model is aware it is being evaluated", backend="toy", out="reports/model-b")
matches = compare(left.report, right.report, out="reports/matches.json")
validation = validate_matches(matches.report, out="reports/match-validation.json")

Evidence sources

interp-lab keeps portable JSONL evidence formats stable in the base package; heavier model tooling lives behind optional extras. Supported paths include toy, JSONL feature dumps, activation records, Neuronpedia, SAE Lens, Goodfire, Gemma Scope / Qwen-Scope, Hugging Face, TransformerLens, NNsight, contrast-direction, and on-demand SAE training. Each integration is an optional bridge (pip install "interp-lab[saelens]", [hf], [transformerlens], [nnsight], [goodfire], [modal], [publish], …).

Architecture

The core object is a FeatureFingerprint:

activation signature
+ text explanation embedding
+ decoder signature
+ causal effect vector
+ examples

Cross-model equivalence is scored by fingerprint similarity; validate-matches turns candidates into explicit evidence grades. The pipeline is built around four small adapter interfaces, so new backends are easy to add:

FeatureProvider — returns candidate features.
Verbalizer — adds NLA-style text explanations.
InterventionRunner — ablates, amplifies, patches, or estimates causal effects.
CriterionCompiler — turns natural-language criteria into examples and scoring hints.

Text matching: lexical by default, semantic when you want it

The text component of a fingerprint defaults to a dependency-free lexical vector (token hashing) — deterministic, offline, and comparable across versions, but it matches shared words, not meaning. For real cross-model and cross-vocabulary matching, opt into a semantic embedder:

pip install "interp-lab[embeddings]"

# Local MiniLM (sentence-transformers): free, offline, no API key.
interp-lab inspect ... --text-embedder minilm
# or set once for a whole pipeline:
export INTERP_LAB_TEXT_EMBEDDER=minilm

Each fingerprint records the embedder that produced it, and matching refuses to compare vectors from different embedders (it drops the text component and renormalizes rather than silently cosine-ing across incompatible axes). interp-lab doctor shows the active embedder and whether the extra is installed.

Note: ranking importance weights are heuristic, not calibrated against ground truth — treat scores as evidence-weighted rankings, not probabilities.

See docs/ARCHITECTURE.md for the full design.

Documentation

Full command reference — every CLI command and the JSONL data formats (feature dumps, activation records, intervention records).
docs/PYTHON_API.md — the Python API.
docs/GOLDEN_REAL_MODEL_DEMO.md — a compact real-model walkthrough (trains a small DistilGPT-2 SAE, suppresses latents, re-inspects with causal evidence, exports an attribution graph).
Archived DistilGPT-2 run — real committed artifacts from that walkthrough: a measured criterion-promoting SAE latent, an authentic suppression dose-response, and semantic (MiniLM) fingerprints. Open inspect-causal/report.html to see the numbers.
docs/REAL_MODEL_DEMOS.md and examples/real_model_demos/ — the broader real-model suite.
docs/GEMMA4_WALKTHROUGH.md and docs/SCALING.md — large-model and 1T+ paths.

Common entry points:

interp-lab demo --out reports/demo            # full toy tour
interp-lab studio --serve --reports-dir reports   # local browser command-builder + runner
interp-lab release-check --strict             # stable-release readiness

Roadmap

Richer Natural Language Autoencoder explanation audits.
Crosscoder training and import.
Distributed SAE training manifests.
Remote causal validation workers.
Feature transfer tests across model families.
Public example gallery with archived real-model reports (started — see examples/real_model_demos/).

Development

python -m pip install -e ".[dev]"
python -m pytest

MIT licensed. Contributions welcome — see the issue tracker.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

asystemoffields

These details have not been verified by PyPI

Release history Release notifications | RSS feed

3.1.0

Jun 10, 2026

3.0.0

Jun 10, 2026

2.3.0

Jun 10, 2026

2.2.0

Jun 5, 2026

This version

2.0.0

May 20, 2026

1.0.0

May 20, 2026

0.2.0

May 18, 2026

0.1.0

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

interp_lab-2.0.0.tar.gz (789.5 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

interp_lab-2.0.0-py3-none-any.whl (241.1 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file interp_lab-2.0.0.tar.gz.

File metadata

Download URL: interp_lab-2.0.0.tar.gz
Upload date: May 20, 2026
Size: 789.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for interp_lab-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`1cb057f27453822c64ad69d4cb98a5e499ed3bd25fcd3b69dc86bd047955de7e`
MD5	`45f0f6a950db9fd8c9b2a736919b81f0`
BLAKE2b-256	`c54ec23b40f2156cfc2541aa54e6aaa86600f2d24b49bc4e5dd081bc124d5e79`

See more details on using hashes here.

Provenance

The following attestation bundles were made for interp_lab-2.0.0.tar.gz:

Publisher: publish.yml on asystemoffields/interp-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: interp_lab-2.0.0.tar.gz
- Subject digest: 1cb057f27453822c64ad69d4cb98a5e499ed3bd25fcd3b69dc86bd047955de7e
- Sigstore transparency entry: 1587138902
- Sigstore integration time: May 20, 2026
Source repository:
- Permalink: asystemoffields/interp-lab@2b42316d33f703b21127b65476c483b6a9b86c3a
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/asystemoffields
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2b42316d33f703b21127b65476c483b6a9b86c3a
- Trigger Event: release

File details

Details for the file interp_lab-2.0.0-py3-none-any.whl.

File metadata

Download URL: interp_lab-2.0.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 241.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for interp_lab-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a67570d620515aa74ea0b6116398011887aaae8479a4507bf8eca3fffc893325`
MD5	`cd3e92e92c15b214dc118023d08410fe`
BLAKE2b-256	`f37c2502c6fb1c8ba8a94e3c5afd359b5e156df41a78f136e5bde3de05642e33`

See more details on using hashes here.

Provenance

The following attestation bundles were made for interp_lab-2.0.0-py3-none-any.whl:

Publisher: publish.yml on asystemoffields/interp-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: interp_lab-2.0.0-py3-none-any.whl
- Subject digest: a67570d620515aa74ea0b6116398011887aaae8479a4507bf8eca3fffc893325
- Sigstore transparency entry: 1587139258
- Sigstore integration time: May 20, 2026
Source repository:
- Permalink: asystemoffields/interp-lab@2b42316d33f703b21127b65476c483b6a9b86c3a
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/asystemoffields
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2b42316d33f703b21127b65476c483b6a9b86c3a
- Trigger Event: release

interp-lab 2.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

interp-lab

What makes this different

The workflow

Evidence sources

Architecture

Text matching: lexical by default, semantic when you want it

Documentation

Roadmap

Development

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance