Lightweight workbench for cross-architecture mechanistic interpretability experiments on small models

These details have not been verified by PyPI

Project description

archscope

Mechanistic interpretability experiments across architectures — Transformers, SSMs/Mamba, recurrent models, and hybrids.

What archscope is

archscope is a small-model interpretability workbench. It's designed for quick, reproducible experiments across model families — not for large-scale SAE training, production model auditing, or replacing mature Transformer-specific tools.

Use it when you want to ask:

Can I extract comparable activations from different architectures?
Do linear probes transfer across model families?
Do induction-like behaviors appear outside attention?
Did a fine-tuned model drift in specific layers?
Do dense or rank-1 SAEs reconstruct this model family better at this layer?

It is not: a competitor to transformer_lens or nnsight (both are broader and more mature), a production audit tool, or a SaaS. It's a small, hackable workbench.

import archscope as mi

# One call → HuggingFace model + tokenizer + the right backend
model, tok, backend = mi.load_model("state-spaces/mamba-130m-hf", arch="mamba")

# Extract Mamba's recurrent SSM state h_t (in addition to residual stream)
ssm = backend.extract(tok("text", return_tensors="pt"), layers=["layer_12.ssm_state"])[0]
# Shape: (B, intermediate_size, ssm_state_size) = (B, 1536, 16) for mamba-130m

load_model handles pad_token setup, model.eval(), and backend auto-detection. If you'd rather drive transformers yourself, every method also accepts backend_hint=....

What's inside

Core mech-interp methods

Module	What it does	Source
`probes`	Linear/MLP probes on hidden states	Drop the Act (arXiv:2605.11467)
`sae`	Dense + Rank-1 factored sparse autoencoders	WriteSAE (arXiv:2605.12770)
`neurons`	Top-K contrastive neuron modulation	Targeted Neuron Mod (arXiv:2605.12290)
`attribute`	Activation patching + DIM decomposition	Multi-Agent Sycophancy (arXiv:2605.12991)
`circuits`	Induction, copy, attention-concentration detectors	Olsson et al 2022
`lens`	Logit lens + Tuned lens	Belrose et al 2023
`diff`	Model-diff: base vs fine-tuned, find what changed	this library

Experiment infrastructure

Module	What it does
`backends`	Unified extraction API across architectures
`transfer`	Cross-arch probe transfer via paired-activation linear alignment
`bench`	InterpProfile — standardized comparable profile (`mi.bench.benchmark()`)

Backends

Backend	Auto-detected `model_type`	What you get
`transformer`	`llama`, `mistral`, `qwen2`, `qwen3`, `gpt2`, `gpt_neox` (Pythia), `gpt_neo`, `gptj`, `falcon`, `mpt`, `bloom`, `opt`, `phi`, `phi3`, `gemma`, `gemma2`, `starcoder2`	residual stream per layer
`mamba`	`mamba`, `mamba2`	residual + explicit `.ssm_state` (recurrent h_t)
`kazdov`	— (pass `hint="kazdov"`)	residual per custom block
`recurrent`	— (pass `hint="recurrent"`, subclass for full extract)	hidden state per layer

If Backend.for_model(model) is called on a model whose config.model_type isn't in the autodetect list, it raises a clear ValueError rather than silently picking a backend. Pass hint="..." explicitly for anything outside the list, or register a new backend via Backend.register("name").

Method × backend support

Not every method works on every architecture. The cross-product:

Method	transformer	mamba	kazdov	recurrent
`probes.fit_probe`	✅	✅	✅	✅
`sae.fit_sae` (Dense / Rank-1)	✅	✅	✅	✅
`neurons.find_neurons`	✅	✅	✅	✅
`attribute.activation_patch`	✅	✅ residual only	✅	⚠️ subclass needed
`attribute.dim_decompose`	✅	❌ no attention/MLP submods	✅	❌
`circuits.*` (behavioural)	✅	✅	✅	✅
`lens.logit_lens`	✅	⚠️ degrades with depth — use `TunedLens`	✅	⚠️
`lens.TunedLens.fit`	✅	✅	✅	⚠️
`diff.compare`	✅	✅	✅	✅
`transfer.evaluate_transfer`	✅ ↔ any	✅ ↔ any	✅ ↔ any	✅ ↔ any
`bench.benchmark`	✅	✅	✅	partial

❌ entries raise a clear ValueError rather than silently degrading.

Install

pip install archscope   # once on PyPI
# or:
git clone https://github.com/OriginalKazdov/archscope.git
cd archscope && pip install -e .

For Mamba on CPU you don't need mamba-ssm — HF's slow path works. On CUDA install mamba-ssm for the fast path.

Quick examples

Train a probe on any architecture

import archscope as mi
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-160m")
tok   = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")
tk = lambda txts: tok(txts, return_tensors="pt", padding=True, truncation=True)

probe = mi.probes.fit_probe(
    model,
    inputs_pos=tk(["I love this", "Wonderful!", "Amazing"]),
    inputs_neg=tk(["I hate this", "Awful", "Terrible"]),
    layer_name="layer_5.residual",
    backend_hint="transformer",
)
print(probe.metrics)   # {'train_auroc': 1.0, ...}

Extract Mamba's SSM recurrent state

backend = mi.backends.Backend.for_model(mamba_model, hint="mamba")
rec = backend.extract(tk("Hello world"), layers=["layer_12.ssm_state"])[0]
# rec.activations.shape == (B, intermediate_size, ssm_state_size)
# This is the actual recurrent memory h_t of Mamba — exposed via the same
# extraction API used for Transformer residual streams.

Logit lens / tuned lens — see what each layer "thinks"

result = mi.lens.logit_lens(
    model, tok,
    prompt="The capital of France is",
    target_token=" Paris",
    backend_hint="transformer",
)
print(result.to_markdown())

# Tuned lens — learned per-layer projections (Belrose et al 2023):
tl = mi.lens.TunedLens.fit(model, tok, calibration_texts, backend_hint="transformer")
tl.predict(model, tok, "...", backend_hint="transformer")

Model Diff — what did fine-tuning change?

from archscope.diff import compare

result = compare(
    base_model, fine_tuned_model, tokenizer,
    calibration_texts=texts,
    backend_hint="transformer",
)
print(result.to_markdown())
# Per-layer residual drift, top shifted neurons, circuit deltas.

Detect circuits cross-arch

scores = mi.circuits.run_all_circuits(model, tokenizer=tok)
print(scores["induction_head"].relative)   # × chance
print(scores["copy_circuit"].score)        # accuracy

InterpBench — standardized model profile

profile = mi.bench.benchmark(
    "EleutherAI/pythia-160m", model, tok,
    backend_hint="transformer", arch_family="transformer",
    tokenize_fn=tk,
)
print(mi.bench.profile_to_markdown(profile))

CLI:

archscope info
archscope bench EleutherAI/pythia-160m --arch transformer --out pythia.json
archscope bench state-spaces/mamba-130m-hf --arch mamba

Findings — running archscope on a mini-zoo of 7 small models

Each model profiled with bench.benchmark() (probes + circuits + dense vs rank-1 SAE). ~10 min total compute on CPU.

Reproduce

python scripts/reproduce_mini_zoo.py
# → _research/mini_zoo_leaderboard.json
# → _research/mini_zoo_leaderboard.md

Skip specific models with --skip Mamba-370m if memory-tight. Kazdov-α is included only if the local checkpoint is available.

Model	Arch	Params	Induction (× chance)	SAE-dense	SAE-rank1	SSM var
Pythia-160m	transformer	162M	490×	0.019	0.025	—
Pythia-410m	transformer	405M	3,261×	0.075	0.135	—
GPT-2	transformer	124M	6,393×	5.731	0.608	—
Mamba-130m	SSM	129M	6,378×	0.048	0.032	0.54
Mamba-370m	SSM	372M	7,730×	0.022	0.027	0.73
Qwen2.5-0.5B	transformer	494M	17,637×	0.092	0.068	—
kazdov-α	hybrid	98M	2,700×	0.043	0.004	—

Open questions raised by this run (single-seed observations, not formal claims):

Does induction-like behavior require attention heads? Mamba — which has no attention mechanism — scores 6378-7730× chance on our behavioral induction test, comparable to or above similarly-sized Transformers. The test is behavioral (output-based), so it doesn't presume any specific mechanism. What in SSMs implements this behavior?
Why does naive logit lens degrade with depth on Mamba? Applying each model's own lm_head to its intermediate residuals surfaces the target with depth on Pythia (target rank 5117 → 77 across 12 layers on "capital of France is Paris"). The same procedure on Mamba moves the target away from top-1 (rank 197 → 1049 across 24 layers). Does this hold across more SSM checkpoints? Is tuned-lens enough to fix it?
Is rank-1 SAE preference architecture-driven or layer-driven? In this run, GPT-2, both Mambas, and kazdov-α reconstructed better with rank-1 factored SAEs at the tested mid-layer; both Pythias preferred dense; Qwen was marginal. Suggestive but needs layer sweeps + multiple seeds before claiming a pattern.
How much do training recipe, tokenizer, and data affect induction-like behavior? Qwen2.5-0.5B shows 17,637× induction — 5.4× higher than Pythia-410m at similar size. Plausibly attributable to data curation + training stability since 2023, but we haven't isolated the cause.
Does Mamba's SSM-state utilization scale with model size? In this run, the input-dependent variance ratio rose 0.54 (Mamba-130m) → 0.73 (Mamba-370m). Does this trend hold across more checkpoints?

These aren't published findings — they're observations from a single mini-zoo run. Methodological corrections welcome.

Metrics caveats

Induction score is behavioral (output-based), not proof of a specific circuit. It tells you the model copies A→B associations in-context; it doesn't tell you how.
SAE reconstruction error is measured on a small sample of mid-layer activations. Lower is better. Numbers are not comparable across layers with different residual magnitudes (e.g., Pythia L11 has very large residuals which dominate dense SAE recon).
SSM-state variance ratio is descriptive — it tells you whether the state changes meaningfully across inputs, not whether the state is causally used downstream.
Logit lens results are diagnostic, not a guarantee of representational alignment. Naive logit lens applies the final lm_head to intermediate residuals — when that fails, it just means the residuals aren't in the final-layer vocab space (e.g., Mamba). TunedLens is the fix.
All probes/SAEs/circuit tests in InterpBench are single-seed. Treat differences <2× as noise.

Honest limits

archscope is a v0.2 release. What it does well: cross-architecture mech-interp primitives, unified API, real observable findings, validated on multiple architectures. What it doesn't do yet:

No causal scrubbing (gold-standard circuit verification)
No interactive notebook viz (matplotlib helpers are TBD)
Circuit detection is limited to induction / copy / attention-concentration — no IOI, name-mover, or successor heads yet
Mamba-2 backend support is partial (Mamba-1 fully supported)
No pretrained SAE collection (you train your own per layer)
Probe transfer assumes same-tokenizer paired data

See CONTRIBUTING.md for what we welcome (new backends, new circuit detectors, viz helpers).

For mature Transformer-centric workflows, prefer transformer_lens or nnsight. They are broader and more mature; archscope focuses on lightweight cross-architecture experiments and small / non-standard model workflows.

Citation

@misc{dovzak2026archscope,
  title  = {archscope: Cross-architecture mechanistic interpretability experiments},
  author = {Juan Cruz Dovzak},
  year   = {2026},
  url    = {https://github.com/OriginalKazdov/archscope}
}

Source papers reimplemented or wrapped:

WriteSAE — arXiv:2605.12770
Drop the Act / ProFIL — arXiv:2605.11467
Targeted Neuron Modulation — arXiv:2605.12290
Multi-Agent Sycophancy — arXiv:2605.12991
Tuned Lens (Belrose et al, 2023)
Induction heads (Olsson et al, 2022)

Troubleshooting

"The fast path is not available because ..." (Mamba on CPU)

Normal. Mamba falls back to a slow pure-PyTorch path that works correctly (~30s per benchmark vs ~1s on CUDA). Install pip install mamba-ssm causal-conv1d only on CUDA machines.

Custom backend not auto-detected

Pass Backend.for_model(model, hint="my_backend") explicitly. Auto-detection uses config.model_type.

`RuntimeError: Trying to backward through the graph a second time`

Activations from Backend.extract() carry the autograd graph by default. Call .detach() before reusing, or extract inside torch.no_grad(). The high-level probes.fit_probe() does this for you.

Roadmap (post-0.2.0)

Multi-token circuit detection: IOI, name-mover, successor heads
Mamba-2 backend with same .ssm_state API
Cross-arch SAE feature alignment (extend transfer.py from probes to features)
Pretrained SAE collection for common small models
Plotly/matplotlib viz helpers
HuggingFace Space demo

PRs welcome — see CONTRIBUTING.md.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.7

May 15, 2026

0.2.6

May 15, 2026

0.2.5

May 15, 2026

0.2.4

May 15, 2026

0.2.3

May 14, 2026

0.2.2

May 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archscope-0.2.7.tar.gz (61.6 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

archscope-0.2.7-py3-none-any.whl (47.5 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file archscope-0.2.7.tar.gz.

File metadata

Download URL: archscope-0.2.7.tar.gz
Upload date: May 15, 2026
Size: 61.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for archscope-0.2.7.tar.gz
Algorithm	Hash digest
SHA256	`0e74b7c623f7ba9cd0e141273c445b7aaeca69dbae9627e6394269b0668f7d69`
MD5	`4ddb50d7e8334579c749d0f1d5ab482a`
BLAKE2b-256	`de1c0f3a53bdce9aa6a3355b88cd9250a5b00368879fd0f659b1f3ebbfcd2755`

See more details on using hashes here.

File details

Details for the file archscope-0.2.7-py3-none-any.whl.

File metadata

Download URL: archscope-0.2.7-py3-none-any.whl
Upload date: May 15, 2026
Size: 47.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for archscope-0.2.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6b1cb0c225a83a80e138ffe603602cb834d9085bb71ac098abd6380db1472dfe`
MD5	`7061f33288dc77a9a337f039abd66dcb`
BLAKE2b-256	`1975aa4e5c0d812f521facd560ed6d2b3bc581c2d0edd244a06b3c1406ea23b3`

See more details on using hashes here.

archscope 0.2.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

archscope

What archscope is

What's inside

Core mech-interp methods

Experiment infrastructure

Backends

Method × backend support

Install

Quick examples

Train a probe on any architecture

Extract Mamba's SSM recurrent state

Logit lens / tuned lens — see what each layer "thinks"

Model Diff — what did fine-tuning change?

Detect circuits cross-arch

InterpBench — standardized model profile

Findings — running archscope on a mini-zoo of 7 small models

Reproduce

Metrics caveats

Honest limits

Citation

Troubleshooting

"The fast path is not available because ..." (Mamba on CPU)

Custom backend not auto-detected

RuntimeError: Trying to backward through the graph a second time

Roadmap (post-0.2.0)

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`RuntimeError: Trying to backward through the graph a second time`