Skip to main content

Sound behavior-equivalence verification for refactors, using your own tests for inputs

Project description

Selfsame

CI PyPI Python License: MIT

Sound behavior-equivalence verification for refactors. Selfsame checks that a refactor didn't change behavior, using your project's own tests for inputs: it captures real call arguments while your tests (or app) run, replays both versions of the code in isolated subprocesses, and compares the results structurally.

Guarantee: zero false confidence. Selfsame never reports equivalent when behavior actually differs, or divergent when it doesn't. When it can't be sure, it refuses (unverifiable / unsupported) rather than guess.

Install

pip install selfsame        # or: pipx install selfsame  ·  uv tool install selfsame

Pure standard library — no runtime dependencies. Installs the selfsame command (probe is a kept alias).

Quickstart

# Did my working-tree refactor change behavior vs main? (inputs come from your tests)
selfsame verify --base main --modules mypkg -- pytest -q

# CI mode: only the functions changed in this PR; non-zero exit if any diverged
selfsame verify --base main --modules mypkg --changed-only -- pytest -q

Commands

command what it does
selfsame verify capture inputs from your tests, replay base vs head, per-function/method verdict (+ CI exit code)
selfsame check generate inputs and check two files or two git refs
selfsame capture record real call arguments from any test or app command
selfsame replay replay captured arguments across two git refs
selfsame attach on-demand capture flush from a running, hook-enabled process
selfsame demo run the built-in corpus end-to-end

Each verdict is one of: equivalent (trustworthy pass), divergent (shows the input + before→after), unverifiable (nondeterministic / uncontrolled I/O — with cause), or unsupported (no input strategy). Everything also works as python -m probe.<cmd>.

Project


How it works (the demo)

selfsame demo        # or: python3 run_probe.py

The demo runs the engine against a hand-built corpus (units/). It's pure stdlib and re-execs once to fix PYTHONHASHSEED=0 so hash/set ordering is controlled for the whole run.

Check a real refactor

The demo above runs the engine against a hand-built corpus. To point it at actual code — two versions of a module — use probe.check. It extracts the top-level functions present in both versions, pairs the ones whose signatures are unchanged, and checks each in an isolated subprocess:

# two files on disk
python3 -m probe.check before.py after.py

# two git refs + a path in the repo
python3 -m probe.check --git main HEAD app/calc.py

Try it on the bundled example (an equivalent refactor, one real bug, an entropy-using function, an unannotated function, and a signature change):

python3 -m probe.check examples/calc_before.py examples/calc_after.py

Each matched function gets one verdict: equivalent (trustworthy pass), divergent (shows the input + before→after), unverifiable (nondeterministic, with cause), or unsupported (no input-generation strategy for its types). The command exits non-zero when any divergence is caught, so it can gate CI. This is the narrow-but-real slice: it works today on deterministic functions with type-hinted, generatable parameters — not yet on stateful classes, I/O against live systems, or cross-file refactors.

Verify a refactor with the repo's own tests (the main path)

probe.check generates inputs, which fails on real code that is untyped or in a package (see experiments/FINDINGS.md). The capture-replay path instead records real arguments from an existing test run and replays both versions, loaded package-aware from git worktrees. No type hints required; relative imports work; methods on classes are supported (the receiver self is captured and rebuilt against each version).

One command — run it from the repo root:

# "did my working-tree refactor change behavior vs main?"
python3 -m probe.verify --base main --modules mypkg -- pytest -q

# any test runner works (capture is injected into every spawned process)
python3 -m probe.verify --base v1.2 --head HEAD --modules mypkg -- python -m unittest

It captures inputs while the tests run, replays both versions, prints a per-function/method verdict, and exits non-zero if any divergence is caught (drop it in CI). --head defaults to your current working tree.

Because the probe runs the target's code and tests, it must use a Python the target supports. Pass --python /path/to/pythonX.Y to run the tests and replay workers under that interpreter; the repo's requires-python is checked and a mismatch is reported loudly instead of silently capturing nothing:

python3 -m probe.verify --base main --modules cachetools \
        --python /path/to/py310/bin/python -- python -m pytest -q

Per-function replay runs in parallel; a function whose replay exceeds PROBE_WORKER_TIMEOUT is reported timeout (not-comparable), never a false pass.

For CI on a PR, add --changed-only to check just the functions whose body changed between base and head (the rest are unchanged and uninteresting):

python3 -m probe.verify --base main --modules mypkg --changed-only -- pytest -q

Inputs from a real app, not just tests

The capture command after -- can be anything that runs your code — a script, an integration harness, or a server — so inputs aren't limited to your test suite:

# capture real call arguments from an actual app run
probe capture --modules mypkg --out caps.pkl -- python -m myapp run-some-workload
probe replay /path/to/repo main HEAD caps.pkl

For a long-running process (a server you exercise by hand), the hook flushes captures every few seconds (PROBE_CAPTURE_FLUSH_SECS), so an abrupt SIGTERM/ SIGKILL still leaves a usable capture file.

You can also snapshot a running, hook-enabled process on demand without stopping it:

# start the process under capture with a known dump directory
probe capture --modules mypkg --capture-dir ./caps --out caps.pkl -- python -m myapp serve
# ...later, in another shell, dump its current captures (process keeps running):
probe attach <pid> --capture-dir ./caps      # writes caps/cap-<pid>.pkl

probe attach sends the hook's flush signal (default SIGUSR1, override with PROBE_CAPTURE_FLUSH_SIGNAL). This works only for processes started under the hook — it does not inject into an arbitrary unmodified process (that needs ptrace/gdb and is heavily restricted, especially on macOS under SIP; see experiments/FINDINGS.md §9).

Capture and replay are also available separately (probe.capture --modules M --out caps.pkl -- <test cmd> then probe.replay <repo> <base> <head> caps.pkl).

Measured: on inflection (untyped history) this turns probe.check's 0% into 100% sound auto-verify (10 equivalent, 3 real behavior changes caught) across 20 real commits — because the inputs come from tests, not from guessing. Coverage then tracks test coverage; the soundness rules (refuse uncontrolled I/O / threads / nondeterminism / opaque returns) are unchanged.

What it does

For every unit (an original + a refactored function):

  1. Generates inputs from type hints (probe/generators.py) plus any unit-supplied seed inputs. (Hypothesis is the production upgrade.)
  2. Self-checks determinism (probe/harness.py): runs the original 3× per input under a controlled environment (frozen clock, seeded RNG, fixed hash seed, recorded I/O). If the three runs disagree, the unit is unverifiable, and the cause is classified (concurrency / uncontrolled-time / uncontrolled-entropy / unknown) by counting threads started and direct time/entropy calls. This is the negative control: stable code must never be flagged.
  3. Diffs the versions (only on deterministic units): runs original vs refactored on the same inputs and compares observed behavior — return value plus the ordered trace of external effects plus exceptions. Any mismatch is a caught behavioral divergence.

Effects (probe/effects.py) are injected and recorded: the trace of external calls is treated as part of behavior, so a refactor that changes which calls fire (or their order) is detected, not hidden.

What the result means

The included corpus (units/) is a stratified stand-in for real LLM-refactored OSS code: pure, time/RNG, I/O, stateful, concurrent, plus three positive controls (refactors that deliberately change behavior and must be caught). Running it demonstrates the engine end-to-end and shows the controls firing:

  • concurrency units → flagged unverifiable (cause: concurrency)
  • positive controls → 3/3 caught (off-by-one, dropped zero-guard, changed default)
  • everything else → verified equivalent, zero false positives

The 89% coverage on this corpus is not a real-world estimate — the mix is deliberately tractable. The real number comes from swapping in real refactors (below).

To run the real probe

Replace the corpus with real material:

  1. Sample ~16 units from real OSS Python repos, stratified per the protocol (don't cherry-pick pure functions).
  2. Generate each refactored version with an actual LLM ("refactor, preserve behavior").
  3. Keep the three positive controls and the A-vs-A negative control.
  4. For real I/O, swap the deterministic Effects stubs for vcrpy-style record-replay against real recorded responses.
  5. Read the verdict; if it lands 40–65%, re-run a confirmation in a typed language (Go).

Files

run_probe.py                         entry point for the corpus demo (fixes hash seed)
probe/verify.py                      CLI: one-command verify via the repo's tests
probe/check.py                       CLI: check a real refactor (two files or git refs)
probe/capture.py                     CLI: capture real call args from any test command
probe/attach.py                      CLI: on-demand flush of a running hook-enabled process
probe/_capture_hook.py               capture hook injected into spawned processes
probe/replay.py                      CLI: replay captured args across two refs (worktrees)
probe/canonical.py                   JSON canonical value form (cross-process compare)
probe/extract.py                     pull + pair functions from two module versions
probe/_worker.py                     isolated per-unit subprocess worker
probe/_replay_worker.py              per-version replay subprocess (functions + methods)
probe/effects.py                     recorded, deterministic effect shims
probe/generators.py                  type-hint-driven input generation
probe/harness.py                     observe / self-check / diff / classify (the core)
probe/equality.py                    structural value equality (not repr)
probe/runner.py                      orchestration, metrics, thresholds, verdict
probe/model.py                       Unit dataclass
units/                               the stratified corpus + positive controls
examples/                            calc_before.py / calc_after.py for probe.check
tests/                               unit + end-to-end tests (python3 -m unittest discover -s tests)

Soundness (the verifier must never be confidently wrong)

A black-box checker can't prove equivalence (the input space is infinite), so the one thing it must never do is say "equivalent" when behavior actually differs. We measured this honestly: a stratified, not-cherry-picked corpus (experiments/, run python3 experiments/measure.py) scores each verdict against author ground truth.

naive tally reality
before soundness work 75% "verifiable" 42% trustworthy, 33% confidently wrong
after soundness work 50% "verifiable" 50% trustworthy, 0% confidently wrong

The three fixes that closed the gap:

  • Uncontrolled I/O is refused, not certified. The harness counts real file and socket access at runtime, and a static scan flags functions that can reach the network/subprocess even when the sampled inputs don't. Either way → unverifiable (uncontrolled-io). (Code that routes I/O through the recorded Effects shim stays verifiable.)
  • Any thread use is unverifiable — even if the sampled runs happened to agree. A race that didn't manifest is not a guarantee.
  • Literals are mined from the code and fed back as inputs, so a bug hinging on a magic value (e.g. a parser that special-cases "on") is caught instead of missed by a fixed value pool.

Residual honest gap: a function whose risky path is only reached by an input the generator never produces (e.g. a valid URL string) can still read "equivalent". This is the fundamental limit of example-based generation — Hypothesis / coverage -guided generation is the real fix, and the I/O static scan already covers the common cases.

Honesty notes (what this engine does and does not guarantee)

  • Equivalence is structural, not repr-based. Objects with only identity equality are compared by their __dict__/__slots__ state; floats handle nan/-0.0; an object we cannot introspect is reported not provably equal rather than guessed.
  • Determinism control is broad but bounded. The harness freezes the clock (time.*, time.*_ns, datetime.now/utcnow/today) and seeds entropy (random, os.urandom, random._urandom, uuid4/1, secrets). It cannot intercept from datetime import datetime (reference captured at import) or per-instance random.Random(...); those surface as an unverifiable verdict, never as silent false confidence.
  • Unsupported inputs are refused, not faked. If the generator has no strategy for a parameter's type, the unit is reported unsupported (counts against coverage) instead of fed a placeholder value.
  • Input generation is bounded, not exhaustive. The stdlib generator caps the number of input combinations per function, so an "equivalent" verdict means "equivalent on the inputs tried", not a proof. Hypothesis is the intended upgrade for real coverage.
  • Isolation is per-unit, not per-call. Each function is checked in its own subprocess (crashes and runaway loops are contained, side effects are kept out of the parent), but the function still runs several times within that process. True per-call sandboxing (containers) is future work.
  • The coverage % is corpus-relative. It is a property of this hand-built stand-in corpus, not a real-world estimate. See "To run the real probe".

Selfsame

Selfsame

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selfsame-0.1.1.tar.gz (62.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

selfsame-0.1.1-py3-none-any.whl (56.8 kB view details)

Uploaded Python 3

File details

Details for the file selfsame-0.1.1.tar.gz.

File metadata

  • Download URL: selfsame-0.1.1.tar.gz
  • Upload date:
  • Size: 62.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for selfsame-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ef6a5e804e687f623df9ad30c09e3a687451a283d89945927da2579a2cc86ac0
MD5 ad83debf82a285326edc3106e28cac19
BLAKE2b-256 1e0c7f51e6b28b7600773edd832a927453d690042720bfcaf506598500e71759

See more details on using hashes here.

File details

Details for the file selfsame-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: selfsame-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 56.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for selfsame-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7849854a0c54e9f34461c1e2f7a395feda0a355b2c9f2fe1f6b8aff29ec6e8b0
MD5 a42e270f8426d77d77e4eb7f17207fed
BLAKE2b-256 aa79c2f8efd2e41d3350a1e2956ef3ae385dab3106ebec7fc75585ee5805caef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page