Sound behavior-equivalence verification for refactors, using your own tests for inputs
Project description
Selfsame
Sound behavior-equivalence verification for refactors. Selfsame checks that a refactor didn't change behavior, using your project's own tests for inputs: it captures real call arguments while your tests (or app) run, replays both versions of the code in isolated subprocesses, and compares the results structurally.
Guarantee: zero false confidence. Selfsame never reports
equivalentwhen behavior actually differs, ordivergentwhen it doesn't. When it can't be sure, it refuses (unverifiable/unsupported) rather than guess.
Install
pip install selfsame # or: pipx install selfsame · uv tool install selfsame
Pure standard library — no runtime dependencies. Installs the selfsame command
(probe is a kept alias).
Quickstart
# Did my working-tree refactor change behavior vs main? (inputs come from your tests)
selfsame verify --base main --modules mypkg -- pytest -q
# CI mode: only the functions changed in this PR; non-zero exit if any diverged
selfsame verify --base main --modules mypkg --changed-only -- pytest -q
Commands
| command | what it does |
|---|---|
selfsame verify |
capture inputs from your tests, replay base vs head, per-function/method verdict (+ CI exit code) |
selfsame check |
generate inputs and check two files or two git refs |
selfsame capture |
record real call arguments from any test or app command |
selfsame replay |
replay captured arguments across two git refs |
selfsame attach |
on-demand capture flush from a running, hook-enabled process |
selfsame demo |
run the built-in corpus end-to-end |
Each verdict is one of: equivalent (trustworthy pass), divergent (shows the
input + before→after), unverifiable (nondeterministic / uncontrolled I/O — with
cause), or unsupported (no input strategy). Everything also works as
python -m probe.<cmd>.
Project
- Contributing: CONTRIBUTING.md · Releasing: RELEASING.md
- Changelog: CHANGELOG.md · Security: SECURITY.md
- Design rationale & validation: experiments/FINDINGS.md
- License: MIT
How it works (the demo)
selfsame demo # or: python3 run_probe.py
The demo runs the engine against a hand-built corpus (units/). It's pure stdlib
and re-execs once to fix PYTHONHASHSEED=0 so hash/set ordering is controlled for
the whole run.
Check a real refactor
The demo above runs the engine against a hand-built corpus. To point it at
actual code — two versions of a module — use probe.check. It extracts the
top-level functions present in both versions, pairs the ones whose signatures
are unchanged, and checks each in an isolated subprocess:
# two files on disk
python3 -m probe.check before.py after.py
# two git refs + a path in the repo
python3 -m probe.check --git main HEAD app/calc.py
Try it on the bundled example (an equivalent refactor, one real bug, an entropy-using function, an unannotated function, and a signature change):
python3 -m probe.check examples/calc_before.py examples/calc_after.py
Each matched function gets one verdict: equivalent (trustworthy pass),
divergent (shows the input + before→after), unverifiable (nondeterministic,
with cause), or unsupported (no input-generation strategy for its types). The
command exits non-zero when any divergence is caught, so it can gate CI. This is
the narrow-but-real slice: it works today on deterministic functions with
type-hinted, generatable parameters — not yet on stateful classes, I/O against
live systems, or cross-file refactors.
Verify a refactor with the repo's own tests (the main path)
probe.check generates inputs, which fails on real code that is untyped or in a
package (see experiments/FINDINGS.md). The capture-replay path instead records
real arguments from an existing test run and replays both versions, loaded
package-aware from git worktrees. No type hints required; relative imports work;
methods on classes are supported (the receiver self is captured and rebuilt
against each version).
One command — run it from the repo root:
# "did my working-tree refactor change behavior vs main?"
python3 -m probe.verify --base main --modules mypkg -- pytest -q
# any test runner works (capture is injected into every spawned process)
python3 -m probe.verify --base v1.2 --head HEAD --modules mypkg -- python -m unittest
It captures inputs while the tests run, replays both versions, prints a
per-function/method verdict, and exits non-zero if any divergence is caught
(drop it in CI). --head defaults to your current working tree.
Because the probe runs the target's code and tests, it must use a Python the
target supports. Pass --python /path/to/pythonX.Y to run the tests and replay
workers under that interpreter; the repo's requires-python is checked and a
mismatch is reported loudly instead of silently capturing nothing:
python3 -m probe.verify --base main --modules cachetools \
--python /path/to/py310/bin/python -- python -m pytest -q
Per-function replay runs in parallel; a function whose replay exceeds
PROBE_WORKER_TIMEOUT is reported timeout (not-comparable), never a false pass.
For CI on a PR, add --changed-only to check just the functions whose body
changed between base and head (the rest are unchanged and uninteresting):
python3 -m probe.verify --base main --modules mypkg --changed-only -- pytest -q
Inputs from a real app, not just tests
The capture command after -- can be anything that runs your code — a script,
an integration harness, or a server — so inputs aren't limited to your test
suite:
# capture real call arguments from an actual app run
probe capture --modules mypkg --out caps.pkl -- python -m myapp run-some-workload
probe replay /path/to/repo main HEAD caps.pkl
For a long-running process (a server you exercise by hand), the hook flushes
captures every few seconds (PROBE_CAPTURE_FLUSH_SECS), so an abrupt SIGTERM/
SIGKILL still leaves a usable capture file.
You can also snapshot a running, hook-enabled process on demand without stopping it:
# start the process under capture with a known dump directory
probe capture --modules mypkg --capture-dir ./caps --out caps.pkl -- python -m myapp serve
# ...later, in another shell, dump its current captures (process keeps running):
probe attach <pid> --capture-dir ./caps # writes caps/cap-<pid>.pkl
probe attach sends the hook's flush signal (default SIGUSR1, override with
PROBE_CAPTURE_FLUSH_SIGNAL). This works only for processes started under the
hook — it does not inject into an arbitrary unmodified process (that needs
ptrace/gdb and is heavily restricted, especially on macOS under SIP; see
experiments/FINDINGS.md §9).
Capture and replay are also available separately (probe.capture --modules M --out caps.pkl -- <test cmd> then probe.replay <repo> <base> <head> caps.pkl).
Measured: on inflection (untyped history) this turns probe.check's 0% into
100% sound auto-verify (10 equivalent, 3 real behavior changes caught) across
20 real commits — because the inputs come from tests, not from guessing. Coverage
then tracks test coverage; the soundness rules (refuse uncontrolled I/O / threads
/ nondeterminism / opaque returns) are unchanged.
What it does
For every unit (an original + a refactored function):
- Generates inputs from type hints (
probe/generators.py) plus any unit-supplied seed inputs. (Hypothesis is the production upgrade.) - Self-checks determinism (
probe/harness.py): runs the original 3× per input under a controlled environment (frozen clock, seeded RNG, fixed hash seed, recorded I/O). If the three runs disagree, the unit is unverifiable, and the cause is classified (concurrency / uncontrolled-time / uncontrolled-entropy / unknown) by counting threads started and direct time/entropy calls. This is the negative control: stable code must never be flagged. - Diffs the versions (only on deterministic units): runs original vs refactored on the same inputs and compares observed behavior — return value plus the ordered trace of external effects plus exceptions. Any mismatch is a caught behavioral divergence.
Effects (probe/effects.py) are injected and recorded: the trace of external calls is
treated as part of behavior, so a refactor that changes which calls fire (or their
order) is detected, not hidden.
What the result means
The included corpus (units/) is a stratified stand-in for real LLM-refactored OSS
code: pure, time/RNG, I/O, stateful, concurrent, plus three positive controls (refactors
that deliberately change behavior and must be caught). Running it demonstrates the engine
end-to-end and shows the controls firing:
- concurrency units → flagged unverifiable (cause: concurrency)
- positive controls → 3/3 caught (off-by-one, dropped zero-guard, changed default)
- everything else → verified equivalent, zero false positives
The 89% coverage on this corpus is not a real-world estimate — the mix is deliberately tractable. The real number comes from swapping in real refactors (below).
To run the real probe
Replace the corpus with real material:
- Sample ~16 units from real OSS Python repos, stratified per the protocol (don't cherry-pick pure functions).
- Generate each
refactoredversion with an actual LLM ("refactor, preserve behavior"). - Keep the three positive controls and the A-vs-A negative control.
- For real I/O, swap the deterministic
Effectsstubs forvcrpy-style record-replay against real recorded responses. - Read the verdict; if it lands 40–65%, re-run a confirmation in a typed language (Go).
Files
run_probe.py entry point for the corpus demo (fixes hash seed)
probe/verify.py CLI: one-command verify via the repo's tests
probe/check.py CLI: check a real refactor (two files or git refs)
probe/capture.py CLI: capture real call args from any test command
probe/attach.py CLI: on-demand flush of a running hook-enabled process
probe/_capture_hook.py capture hook injected into spawned processes
probe/replay.py CLI: replay captured args across two refs (worktrees)
probe/canonical.py JSON canonical value form (cross-process compare)
probe/extract.py pull + pair functions from two module versions
probe/_worker.py isolated per-unit subprocess worker
probe/_replay_worker.py per-version replay subprocess (functions + methods)
probe/effects.py recorded, deterministic effect shims
probe/generators.py type-hint-driven input generation
probe/harness.py observe / self-check / diff / classify (the core)
probe/equality.py structural value equality (not repr)
probe/runner.py orchestration, metrics, thresholds, verdict
probe/model.py Unit dataclass
units/ the stratified corpus + positive controls
examples/ calc_before.py / calc_after.py for probe.check
tests/ unit + end-to-end tests (python3 -m unittest discover -s tests)
Soundness (the verifier must never be confidently wrong)
A black-box checker can't prove equivalence (the input space is infinite), so
the one thing it must never do is say "equivalent" when behavior actually
differs. We measured this honestly: a stratified, not-cherry-picked corpus
(experiments/, run python3 experiments/measure.py) scores each verdict
against author ground truth.
| naive tally | reality | |
|---|---|---|
| before soundness work | 75% "verifiable" | 42% trustworthy, 33% confidently wrong |
| after soundness work | 50% "verifiable" | 50% trustworthy, 0% confidently wrong |
The three fixes that closed the gap:
- Uncontrolled I/O is refused, not certified. The harness counts real file
and socket access at runtime, and a static scan flags functions that can
reach the network/subprocess even when the sampled inputs don't. Either way →
unverifiable (uncontrolled-io). (Code that routes I/O through the recorded Effects shim stays verifiable.) - Any thread use is unverifiable — even if the sampled runs happened to agree. A race that didn't manifest is not a guarantee.
- Literals are mined from the code and fed back as inputs, so a bug hinging
on a magic value (e.g. a parser that special-cases
"on") is caught instead of missed by a fixed value pool.
Residual honest gap: a function whose risky path is only reached by an input the generator never produces (e.g. a valid URL string) can still read "equivalent". This is the fundamental limit of example-based generation — Hypothesis / coverage -guided generation is the real fix, and the I/O static scan already covers the common cases.
Honesty notes (what this engine does and does not guarantee)
- Equivalence is structural, not
repr-based. Objects with only identity equality are compared by their__dict__/__slots__state; floats handlenan/-0.0; an object we cannot introspect is reported not provably equal rather than guessed. - Determinism control is broad but bounded. The harness freezes the clock
(
time.*,time.*_ns,datetime.now/utcnow/today) and seeds entropy (random,os.urandom,random._urandom,uuid4/1,secrets). It cannot interceptfrom datetime import datetime(reference captured at import) or per-instancerandom.Random(...); those surface as an unverifiable verdict, never as silent false confidence. - Unsupported inputs are refused, not faked. If the generator has no strategy
for a parameter's type, the unit is reported
unsupported(counts against coverage) instead of fed a placeholder value. - Input generation is bounded, not exhaustive. The stdlib generator caps the number of input combinations per function, so an "equivalent" verdict means "equivalent on the inputs tried", not a proof. Hypothesis is the intended upgrade for real coverage.
- Isolation is per-unit, not per-call. Each function is checked in its own subprocess (crashes and runaway loops are contained, side effects are kept out of the parent), but the function still runs several times within that process. True per-call sandboxing (containers) is future work.
- The coverage % is corpus-relative. It is a property of this hand-built stand-in corpus, not a real-world estimate. See "To run the real probe".
Selfsame
Selfsame
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file selfsame-0.1.1.tar.gz.
File metadata
- Download URL: selfsame-0.1.1.tar.gz
- Upload date:
- Size: 62.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef6a5e804e687f623df9ad30c09e3a687451a283d89945927da2579a2cc86ac0
|
|
| MD5 |
ad83debf82a285326edc3106e28cac19
|
|
| BLAKE2b-256 |
1e0c7f51e6b28b7600773edd832a927453d690042720bfcaf506598500e71759
|
File details
Details for the file selfsame-0.1.1-py3-none-any.whl.
File metadata
- Download URL: selfsame-0.1.1-py3-none-any.whl
- Upload date:
- Size: 56.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7849854a0c54e9f34461c1e2f7a395feda0a355b2c9f2fe1f6b8aff29ec6e8b0
|
|
| MD5 |
a42e270f8426d77d77e4eb7f17207fed
|
|
| BLAKE2b-256 |
aa79c2f8efd2e41d3350a1e2956ef3ae385dab3106ebec7fc75585ee5805caef
|