Criterion-driven feature discovery, explanation, causal testing, and cross-model activation matching.
Project description
interp-lab
interp-lab is an open-source starter kit for criterion-driven mechanistic interpretability.
Give it a model, a criterion, and feature evidence. It ranks internal features, explains them, tests causal impact, and searches for equivalent features in other models.
Quick start:
interp-lab inspect \
--model google/gemma-2-2b \
--criterion "the model is aware it is being evaluated" \
--backend toy \
--out reports/eval-awareness
Python API:
from interp_lab import compare, inspect
left = inspect(
"toy/model-a",
"the model is aware it is being evaluated",
backend="toy",
out="reports/model-a",
)
right = inspect(
"toy/model-b",
"the model is aware it is being evaluated",
backend="toy",
out="reports/model-b",
)
matches = compare(left.report, right.report, out="reports/matches.json")
The package includes toy, JSONL, activation-record, Neuronpedia, SAE Lens, Goodfire, Gemma Scope/Qwen-Scope, Hugging Face, TransformerLens, NNsight, contrast-direction, and on-demand SAE training paths. It is shaped around adapter interfaces for real activation hooks, SAEs, crosscoders, and natural-language autoencoders.
Why This Exists
The goal is to get close to an "oracular SAE" workflow:
- Compile a natural-language criterion into examples and scores.
- Collect candidate features from SAEs, crosscoders, NLA explanations, or feature dumps.
- Rank features by criterion association, specificity, causal evidence, and stability.
- Build a feature fingerprint that can be compared across models.
- Validate cross-model equivalents with interventions.
Commands
Check your local environment:
interp-lab doctor
Profile the current machine and route options:
interp-lab profile-env --out reports/env-profile.json --json
Run a criterion inspection:
interp-lab inspect --model toy/a --criterion "Python security bug" --backend toy
Compare two reports:
interp-lab match \
--left reports/a/report.json \
--right reports/b/report.json \
--out reports/matches.json
This writes both matches.json and a readable markdown report with labels, component scores, and signed effects when present.
Create a demo run:
interp-lab demo --out reports/demo
Run a reproducible workflow from config:
interp-lab run examples/run_records.json
This writes a run manifest with the tool version, platform, input hashes, executed steps, and output paths. Run configs can be JSON, TOML, or YAML.
Export activation records from a real Hugging Face model:
interp-lab export-hf-records \
--model distilgpt2 \
--dataset examples/hf_prompts_unit_prediction.jsonl \
--out reports/real-small/distilgpt2-unit/records.jsonl
Export activation records from TransformerLens hooks:
python -m pip install "interp-lab[transformerlens]"
interp-lab export-transformerlens-records \
--model gpt2-small \
--dataset examples/hf_prompts_unit_prediction.jsonl \
--layers 6 \
--out reports/tl/gpt2-small-layer6-records.jsonl
Export activation records from NNsight traces:
python -m pip install "interp-lab[nnsight]"
interp-lab export-nnsight-records \
--model openai-community/gpt2 \
--dataset examples/hf_prompts_unit_prediction.jsonl \
--activation-path transformer.h[6].output[0] \
--out reports/nnsight/gpt2-layer6-records.jsonl
Export ablation records for top hidden-dimension features:
interp-lab export-hf-interventions \
--model distilgpt2 \
--report reports/real-small/distilgpt2-unit/inspect/report.json \
--dataset examples/hf_prompts_unit_prediction.jsonl \
--criterion "the next token should be a physical measurement unit" \
--out reports/real-small/distilgpt2-unit/interventions.jsonl
Export a contrast-direction feature and calibrate a causal steering strength:
interp-lab export-hf-contrast \
--model distilgpt2 \
--dataset examples/hf_prompts_unit_prediction.jsonl \
--criterion "the next token should be a physical measurement unit" \
--records-out reports/real-small/distilgpt2-unit/contrast-records.jsonl \
--interventions-out reports/real-small/distilgpt2-unit/contrast-interventions.jsonl \
--strength-sweep "3,10,30,100"
export-hf-contrast learns a positive-minus-negative hidden-state direction from scored prompts. When --strength-sweep is set, it tests each steering strength on positive prompts, uses negative prompts as side-effect checks, and writes intervention rows for the most specific setting.
Train an SAE when no public SAE exists:
interp-lab train-sae \
--preset minimal \
--hf-model distilgpt2 \
--dataset examples/hf_prompts_unit_prediction.jsonl \
--layer 6 \
--latent-dim 64 \
--epochs 50 \
--out reports/real-small/distilgpt2-unit/trained-sae/sae.json \
--records-out reports/real-small/distilgpt2-unit/trained-sae/records.jsonl
Use --preset minimal for quick local exploration. It trains on one activation row per prompt and keeps the compute footprint small.
Use --preset production when you want a stronger artifact:
interp-lab train-sae \
--preset production \
--hf-model distilgpt2 \
--dataset examples/hf_prompts_unit_prediction.jsonl \
--layer 6 \
--latent-dim 1024 \
--out reports/production-sae/sae.json \
--records-out reports/production-sae/records.jsonl \
--causal-out reports/production-sae/interventions.jsonl \
--criterion "the next token should be a physical measurement unit"
Production mode uses token-level activation rows, top-k sparse codes, held-out reconstruction metrics, dead-latent reporting, and optional SAE-latent steering interventions when --causal-out is provided. You can override any preset choice, such as --epochs, --batch-size, --top-k, or --max-records.
Then inspect the learned SAE latents with the normal records backend:
interp-lab inspect \
--model distilgpt2 \
--criterion "the next token should be a physical measurement unit" \
--backend records \
--records reports/real-small/distilgpt2-unit/trained-sae/records.jsonl \
--out reports/real-small/distilgpt2-unit/trained-sae/inspect
train-sae can also train from an existing activation-record JSONL:
interp-lab train-sae \
--records reports/real-small/distilgpt2-unit/records.jsonl \
--model distilgpt2 \
--latent-dim 256 \
--method auto \
--out reports/sae/sae.json \
--records-out reports/sae/records.jsonl
Training uses PyTorch when available. --method fallback uses a deterministic sparse dictionary trainer, which is useful for small runs, constrained environments, and smoke tests. Set --latent-dim directly for any SAE width, or use --expansion-factor to scale from the input dimension. By default, the exported activation records write every learned latent; --top-k-features can compress large runs. --max-records bounds training on large JSONL streams with deterministic reservoir sampling.
Rank features from per-prompt activation records:
interp-lab inspect \
--model my/model \
--criterion "the model is aware it is being evaluated" \
--backend records \
--records examples/activation_records.jsonl \
--out reports/eval-awareness
Add causal intervention evidence:
interp-lab inspect \
--model my/model \
--criterion "the model is aware it is being evaluated" \
--backend records \
--records examples/activation_records.jsonl \
--interventions examples/interventions.jsonl \
--out reports/eval-awareness-causal
Import selected features from Neuronpedia:
interp-lab inspect \
--model gpt2-small \
--criterion "mentions of measurements in meters or feet" \
--backend neuronpedia \
--neuronpedia-feature gpt2-small@6-res_scefr-ajt:650 \
--out reports/neuronpedia-measurements
Import selected features from a pretrained SAE Lens SAE:
python -m pip install "interp-lab[saelens]"
interp-lab inspect \
--model gpt2-small \
--criterion "numeric measurements" \
--backend saelens \
--saelens-release gpt2-small-res-jb \
--saelens-sae-id blocks.6.hook_resid_pre \
--saelens-feature-indexes 650 \
--out reports/saelens-feature
Import Goodfire features:
python -m pip install "interp-lab[goodfire]"
interp-lab inspect \
--model meta-llama/Llama-3.1-8B-Instruct \
--criterion "formal writing style" \
--backend goodfire \
--goodfire-top-k 20 \
--out reports/goodfire-formal-style
Import selected features from named SAE suites:
interp-lab inspect \
--model google/gemma-2-2b \
--criterion "numeric measurements" \
--backend scope \
--scope-source gemma-scope \
--scope-release <saelens-release-or-hf-repo> \
--scope-sae-id blocks.6.hook_resid_post \
--scope-feature-indexes 650 \
--out reports/gemma-scope-feature
Publish reports or artifact folders to Hugging Face Hub:
python -m pip install "interp-lab[publish]"
interp-lab publish-hf-artifact \
--repo-id your-user/interp-lab-demo \
--repo-type dataset \
--path reports/real-small/distilgpt2-unit \
--tag sae \
--tag activation-records
Export a report as a causal attribution graph:
interp-lab export-attribution-graph \
--report reports/eval-awareness/report.json \
--out reports/eval-awareness/graph.json \
--include-similarity-edges
Plan a large run before harvesting activations:
interp-lab plan-scale \
--model-params 1T \
--tokens 1B \
--d-model 16384 \
--selected-layers 8 \
--latent-dim 1M \
--from-env \
--target-shard-size 64GB \
--out reports/scale-plan.json
JSONL Feature Dumps
You can inspect a model from a JSONL feature dump:
interp-lab inspect \
--model my/model \
--criterion "refusal behavior" \
--features examples/features.jsonl \
--out reports/refusal
Each row should look like this:
{
"feature_id": "L18:F104921",
"model": "my/model",
"layer": 18,
"label": "constructed benchmark or test scenario",
"examples": ["This looks like a test case...", "The prompt appears artificial..."],
"activation_signature": [0.9, 0.2, 0.1],
"decoder_signature": [0.1, -0.4, 0.3],
"causal_effects": {"criterion": 0.34, "refusal": 0.12},
"source": "sae"
}
Activation Records
Activation records are the most flexible import path. Use them when you have per-prompt or per-token feature activations from an SAE, crosscoder, NLA probe, Neuronpedia script, remote activation harvester, or custom hook.
Each row is one prompt or token position:
{
"model": "my/model",
"prompt_id": "eval-1",
"text": "This looks like a benchmark task...",
"criterion_score": 1.0,
"features": [
{
"feature_id": "L18:F104921",
"activation": 0.92,
"label": "constructed benchmark or test scenario",
"layer": 18,
"decoder_signature": [0.1, -0.4, 0.3, 0.2]
}
]
}
interp-lab streams records by feature, estimates criterion association from sufficient statistics, preserves top activating examples, and creates a feature fingerprint for matching. Add intervention records when you want causal evidence in the report.
Intervention Records
Intervention records let the report distinguish correlational evidence from causal evidence. Each row is one ablation, amplification, clamp, patch, or steering run:
{
"model": "my/model",
"feature_id": "L18:F104921",
"criterion": "the model is aware it is being evaluated",
"intervention": "ablate",
"prompt_id": "eval-1",
"baseline_score": 0.92,
"intervention_score": 0.31,
"side_effect_score": 0.04
}
For ablate, zero, remove, knockout, suppress, and clamp_down, a score drop is treated as evidence the feature promotes the criterion. For amplify, steer, patch, patch_in, clamp, and clamp_up, a score rise is treated as evidence the feature promotes the criterion.
Hugging Face exporters use positive-scored prompts for criterion effects and negative-scored prompts for side-effect estimates. That makes a report prefer features that move the requested behavior while leaving nearby unrelated prompts stable.
Rows with a criterion field are matched to the CLI criterion by normalized exact text. Omit criterion, or pass --allow-intervention-criterion-mismatch, when you want to reuse intervention files across paraphrased criteria.
Control rows can be included in the same intervention JSONL by setting metadata.control_type to values such as random_feature, matched_frequency, or placebo. Reports include confidence intervals, control-effect summaries, and a strong_causal_score.
Neuronpedia
The Neuronpedia backend reads the public feature JSON endpoint documented by Neuronpedia. It accepts refs like:
gpt2-small@6-res_scefr-ajt:650
https://www.neuronpedia.org/gpt2-small/6-res_scefr-ajt/650
https://www.neuronpedia.org/api/feature/gpt2-small/6-res_scefr-ajt/650
Neuronpedia features include dashboard evidence, autointerp explanations, top activating examples, logits, sparsity, and related metadata. interp-lab converts those into feature evidence and fingerprints.
SAE Lens
The SAE Lens backend is optional because it can pull in heavier model tooling. It uses SAE.from_pretrained_with_cfg_and_sparsity() when available, extracts selected decoder rows, and wraps them as interp-lab feature evidence. For criterion ranking over real prompts, export SAE activations into activation records and run the records backend.
Ecosystem Bridges
- Goodfire: semantic feature search through the Goodfire SDK.
- Neuronpedia: public feature endpoint import.
- SAE Lens: pretrained SAE decoder-row import.
- Gemma Scope and Qwen-Scope: named wrappers around SAE-suite metadata.
- TransformerLens: hook-cache activation export.
- NNsight: trace-based activation export for local or remote model execution.
- Hugging Face Hub: artifact publishing for reports, records, interventions, and trained SAE metadata.
Each bridge is optional. The base package keeps the portable JSONL evidence formats stable, while heavier model tooling lives behind extras.
Scaling
For large models, use interp-lab as the orchestration and evidence layer:
- Harvest activations through the environment that can run the model.
- Write sharded activation records or SAE feature records.
- Train or import SAEs against those shards.
- Stream records into inspection reports.
- Run causal validation in resumable batches.
- Publish reports, graphs, and artifacts with manifests.
interp-lab profile-env inspects CPU cores, RAM, disk space, local accelerators, optional packages, and sanitized environment flags such as whether Goodfire or NNsight credentials are present. It returns advisory route options, including local CPU, single GPU, cluster, remote API, and frontier-lab style harvesting.
interp-lab plan-scale accepts human-friendly sizes such as 70B, 1T, 1B, and 64GB. It estimates dense activation storage, sparse feature-record storage, SAE parameter storage, causal validation forward passes, shard counts, risk flags, and agent next actions. Add --from-env to profile the current machine while planning, or --env-profile other-machine.json to plan against a saved profile from another environment. Every route suggestion can be overridden with --profile. Use --json or --out scale-plan.json when an AI agent or workflow should consume the plan directly. See docs/SCALING.md for the 1T+ path.
Architecture
The core object is a FeatureFingerprint:
activation signature
+ text explanation embedding
+ decoder signature
+ causal effect vector
+ examples
Cross-model equivalence is scored by fingerprint similarity. A match becomes interesting when it also preserves intervention effects.
Adapters are intentionally small:
FeatureProvider: returns candidate features.Verbalizer: adds NLA-style text explanations.InterventionRunner: ablates, amplifies, patches, or estimates causal effects.CriterionCompiler: turns natural-language criteria into examples and scoring hints.
Roadmap
- Natural Language Autoencoder adapter.
- Crosscoder training and import.
- Rich HTML feature cards.
- Distributed SAE training manifests.
- Remote causal validation workers.
- Feature transfer tests across model families.
Development
python -m pip install -e ".[dev]"
python -m pytest
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file interp_lab-0.2.0.tar.gz.
File metadata
- Download URL: interp_lab-0.2.0.tar.gz
- Upload date:
- Size: 99.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2897c6f8ae3731d9c1dd42e161b364180476b87bc564dc3fd6b72ca3080500b4
|
|
| MD5 |
c6335d0be100bb493d8a24fd2fe39a46
|
|
| BLAKE2b-256 |
a0af7c1b626892e1cc219218a50c094f0a2127cc86ba7207081443cd074c2f23
|
Provenance
The following attestation bundles were made for interp_lab-0.2.0.tar.gz:
Publisher:
publish.yml on asystemoffields/interp-lab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
interp_lab-0.2.0.tar.gz -
Subject digest:
2897c6f8ae3731d9c1dd42e161b364180476b87bc564dc3fd6b72ca3080500b4 - Sigstore transparency entry: 1569790793
- Sigstore integration time:
-
Permalink:
asystemoffields/interp-lab@2f60a367fff2bf5d895563b7910a99ee84468954 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/asystemoffields
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2f60a367fff2bf5d895563b7910a99ee84468954 -
Trigger Event:
release
-
Statement type:
File details
Details for the file interp_lab-0.2.0-py3-none-any.whl.
File metadata
- Download URL: interp_lab-0.2.0-py3-none-any.whl
- Upload date:
- Size: 97.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00f88e5952cc5bef0ad0142ad20f0a7fde10a906b0c8ee8dec7fc537f2b5afbb
|
|
| MD5 |
1ac50196b07ece281aefd64c043cfdec
|
|
| BLAKE2b-256 |
0119cc63b45619b5f1ecd703252e10923acad69569d84c16d7d3a66cae765f26
|
Provenance
The following attestation bundles were made for interp_lab-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on asystemoffields/interp-lab
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
interp_lab-0.2.0-py3-none-any.whl -
Subject digest:
00f88e5952cc5bef0ad0142ad20f0a7fde10a906b0c8ee8dec7fc537f2b5afbb - Sigstore transparency entry: 1569790919
- Sigstore integration time:
-
Permalink:
asystemoffields/interp-lab@2f60a367fff2bf5d895563b7910a99ee84468954 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/asystemoffields
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2f60a367fff2bf5d895563b7910a99ee84468954 -
Trigger Event:
release
-
Statement type: