Skip to main content

Circuit-level regression testing for AI systems

Project description

  ██      ██    ██████    ████████    ████████    ██████████  ██      ██
  ██      ██  ██      ██  ██      ██  ██      ██  ██          ████    ██
  ██      ██  ██      ██  ██      ██  ██      ██  ██          ██  ██  ██
  ██  ██  ██  ██████████  ████████    ██      ██  ████████    ██  ██  ██
  ██  ██  ██  ██      ██  ██  ██      ██      ██  ██          ██    ████
  ████  ████  ██      ██  ██    ██    ██      ██  ██          ██      ██
  ██      ██  ██      ██  ██      ██  ████████    ██████████  ██      ██

The problem

Behavioral evals (accuracy, BLEU, LLM-as-judge) only see input→output. A model can pass every one of them while the mechanism underneath silently shifts to something brittle — a shortcut, a spurious feature, a circuit that happens to produce the right answer for the wrong reason (feature absorption / shortcut learning). Nothing in a standard eval suite tests whether the internal computation itself is still doing what you think it's doing, so this kind of drift ships unnoticed until it fails somewhere an eval didn't cover.

How Warden solves it

Warden lets you write declarative assertions about a model's internal mechanism — not just its output — and run them like tests: "is this behavior mechanistically necessary and sufficient in this circuit, and hasn't the mechanism silently drifted?"

It uses sparse autoencoders to identify the features involved in a behavior, and causal patching (ablation / activation-patching) to measure whether that circuit is actually driving the behavior, rather than merely correlated with it. The heavy numerics run in a Rust core (PyO3); a Python layer handles orchestration, the DSL, and the CLI — so contracts read like tests and slot into a normal pytest run or CI pipeline.

Full docs, one per component, with all the "why" and the real numbers behind each claim: docs/. This README is the quickstart; docs/ is where the depth lives.


What's here

Everything below is implemented and verified against real GPT-2 small — not mocked, not synthetic where it mattered:

capability what docs
Circuit testing circuit discovery, necessity/sufficiency, contract DSL, pytest plugin, CLI contracts.md
Drift detection drift assertions, HTML reports, GitHub Action drift.md
Production monitoring plugin SDK, warden sample, OpenTelemetry/Prometheus production.md
Self-serve SAEs warden train-sae, hand-derived backprop in Rust sae-training.md

Two deliberate simplifications (each doc above explains why):

  • Circuits are flat top-k SAE feature lists, not full attribution graphs. Drift is a Jaccard-distance proxy, not graph-edit-distance.
  • warden sample re-checks fixed contracts on an interval rather than mining circuits from unlabeled live traffic (which discover_circuit's methodology doesn't support).

Install

uv venv .venv && source .venv/bin/activate
uv pip install maturin
maturin develop --release
uv pip install -e ".[dev]"

Note: --release is important — debug builds are ~10-20x slower for training.


Quickstart

python examples/demo.py
# or:
warden run examples/ioi_circuit.warden.yaml --json-out report.json --html-out report.html
warden report report.json

A contract (*.warden.yaml) declares a model, a layer, an SAE, an eval set, and assertions:

name: ioi_name_mover_circuit
model: gpt2
layer: 9
sae:
  repo_id: jbloom/GPT2-Small-SAEs-Reformatted
  subfolder: blocks.9.hook_resid_pre
eval_set: ioi_eval.jsonl
assertions:
  - type: necessity
    min_score: 0.3
  - type: sufficiency
    min_score: 0.08

Contracts run real model forward passes, so both the pytest plugin and @warden.contract-decorated functions are skipped by defaultpytest --warden opts in.

Full DSL reference, the Python decorator form, and what necessity/sufficiency actually compute: docs/contracts.md.


Features

  • Necessity & sufficiency — ablate or activation-patch a discovered circuit and measure the causal effect on a real behavior, not just correlation. → docs/contracts.md

  • Drift detection — compare a checkpoint's circuit against a baseline's; demoed catching a real, self-inflicted regression from a fine-tune. → docs/drift.md

  • Plugin SDK — swap in your own model adapter or SAE loader, via a registry or a real Python entry point, no fork required. → docs/production.md

  • Production monitoringwarden sample re-checks contracts against a live checkpoint path on an interval, exporting real Prometheus/OTel metrics. → docs/production.md

  • Self-serve SAE trainingwarden train-sae for layers with no public dictionary; hand-derived forward/backward pass in Rust (no autodiff there), verified by numerical gradient checking. → docs/sae-training.md

  • HTML reports & GitHub Action — a self-contained report for human review, and a reusable composite action to block merges on regressions. → docs/drift.md, docs/production.md


Results

Real numbers from real runs — not illustrative. Full detail, including two real bugs found (and fixed) by actually running things twice, and an honest report of where the self-serve SAE trainer currently falls short: docs/results.md.

contract 'ioi_name_mover_circuit': PASS
  [PASS] necessity=0.390 (min 0.3)
  [PASS] sufficiency=0.134 (min 0.08)

A real fine-tune-induced regression, caught:

contract 'ioi_circuit_drift_check': FAIL
  [FAIL] drift=0.824 (max 0.5)

A real Prometheus scrape:

warden_necessity{contract="demo", ...} 0.39

Development

cargo test                       # Rust unit tests (pure ndarray math + gradient checking, no Python needed)
maturin develop --release         # rebuild the extension after Rust changes
pytest -m "not integration"       # fast, fully offline Python tests
pytest                            # + integration tests: real GPT-2 + SAE download/forward passes
pytest --warden                   # also runs *.warden.yaml / @warden.contract items directly

-m "not integration" controls which of this repo's own tests touch the network/a real model; --warden controls whether contracts a warden user writes run when their project's pytest executes — two independent gates for two different things. CI (.github/workflows/ci.yml) runs cargo test

  • pytest -m "not integration" only — no network needed.

See docs/README.md for the full documentation index.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

warden_interp-0.1.0.tar.gz (61.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

warden_interp-0.1.0-cp38-abi3-win_amd64.whl (304.0 kB view details)

Uploaded CPython 3.8+Windows x86-64

warden_interp-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (431.7 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

warden_interp-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (410.2 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

warden_interp-0.1.0-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (747.5 kB view details)

Uploaded CPython 3.8+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file warden_interp-0.1.0.tar.gz.

File metadata

  • Download URL: warden_interp-0.1.0.tar.gz
  • Upload date:
  • Size: 61.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for warden_interp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 db3b7a0e991226db65ac2efe02c9af38cf342c54a4392f9d7bae4f317b592718
MD5 8693f2e27e87f1d01163f2e58743e70f
BLAKE2b-256 f24d993642538d65a72fe31a48ec8071e0ee1371edac10f836f70c70a31a40d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for warden_interp-0.1.0.tar.gz:

Publisher: release.yml on ghassenov/warden

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file warden_interp-0.1.0-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: warden_interp-0.1.0-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 304.0 kB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for warden_interp-0.1.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 9f115efcf53e4bc5be4565ce5e05d580adc9381641566299bc0f2990c030a07c
MD5 d0dcede771f73127492e210870626a75
BLAKE2b-256 6ef47b2870c80c14176eaab3dcc9746c7d9259daad6bf061fccd8edf5a5f1df7

See more details on using hashes here.

Provenance

The following attestation bundles were made for warden_interp-0.1.0-cp38-abi3-win_amd64.whl:

Publisher: release.yml on ghassenov/warden

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file warden_interp-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for warden_interp-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7b4d40bbc291c37ec1bf767a70de2f3848efdb390657bea87c01c22ab845beb2
MD5 d9b2e2a5ef8b0c87449673ea26a456be
BLAKE2b-256 132696ef532523745f1c337c050c977bb83c4b085eaaf113e62e861e34918a9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for warden_interp-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on ghassenov/warden

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file warden_interp-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for warden_interp-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7fc380e93e6b26a08c9ba94cb1ef43d7f92b9eebd8c09f110fb82f9bc8ab1928
MD5 764902ba5778336151f3b39fa44643ec
BLAKE2b-256 cfd98e3eb82442affcd302b349c72f4927096b908dfe4cee4a08ca986cd1134a

See more details on using hashes here.

Provenance

The following attestation bundles were made for warden_interp-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on ghassenov/warden

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file warden_interp-0.1.0-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for warden_interp-0.1.0-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 cb3a854101672d75d1011d55239f2d5ccbc4a1ee90b5d28512b7cf192be69acd
MD5 5796291647f287dc06b85c6e989d7476
BLAKE2b-256 6b2fe1169fcefdc058a6647efc756a652b25867f1c8f57f3aaf58fbdec0abfef

See more details on using hashes here.

Provenance

The following attestation bundles were made for warden_interp-0.1.0-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release.yml on ghassenov/warden

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page