Skip to main content

An SMT-based verifier for Python with a machine-checked trust base.

Project description

Touchstone

An SMT-based verifier for Python. Touchstone takes a function and a property and returns PROVED (it holds for all inputs), REFUTED (with a counterexample), or UNKNOWN (with a reason), by translating the code to Z3 rather than running it.

import touchstone as t

# state the property in Python, over the parameters and `result`
t.prove("def f(x):\n    return x + x\n", "result == 2 * x").status        # 'PROVED'

# or write the contract as decorators on the function itself
t.verify_contracts('''
@require("n >= 0")
@ensure("result == n")
def count(n):
    i = 0
    while i < n:
        i = i + 1
    return i
''').status                                                              # 'PROVED'

# or check two implementations agree on every input
t.verify_equiv("double", "f", "def f(a):\n    return a + a\n",
               "def g(a):\n    return 2 * a\n", {}).status                # 'PROVED'

# or verify a whole program, resolving imports across modules through the call graph
t.check_program({
    "lib":  "def recip(n):\n    return 100 // n\n",        # traps when n == 0
    "main": "import lib\ndef f(n):\n    if n == 0:\n        return 0\n    return lib.recip(n)\n",
}, "main", "f").status                                                   # 'PROVED' (the guard covers the callee)

Command line

The same verbs run from the shell, with the process exit status mirroring the verdict (0 PROVED, 1 REFUTED, 2 UNKNOWN) so they compose in CI:

touchstone verify count.py                       # the @require / @ensure contracts written in a file
touchstone prove  f.py --ensures 'result == x'   # a postcondition over the parameters and `result`
touchstone equiv  impl.py spec.py --func f       # two implementations agree on every input
touchstone check  d.py                           # trap freedom (and any asserts) for all inputs
touchstone infer  m.py                           # sound over-approximate types of a return and its locals

A refutation comes back with the counterexample and the path it took:

$ touchstone prove f.py --ensures 'result == x'
REFUTED  [property via verified VC generator (Rocq-extracted wpg)]
  counterexample: x=0
  trace:
    line 2: return x + 1    [x=0]
    => returns 1

What it covers

Functional equivalence and predicates; whole-function and interprocedural reasoning over control flow with multiple loops, arbitrary nesting, break and continue, statements after a loop, and any step direction; self-recursion, mutual recursion, and recursion over lists; whole-program verification across module boundaries, resolving from-imports, qualified m.f() calls, and aliases through the call graph so a callee's trap is seen at the caller; deductive and synthesized loop invariants; abstract interpretation (interval, zone, octagon, Karr, polyhedra, machine-integer); IEEE-754 floating point total over every double, with Inf and NaN as first-class inputs, exact floor division and modulo, and sound over-approximations of sin, cos, exp, and log; arrays with quantified specifications; termination of counted loops, iteration over containers, and data-dependent loops, and cost; exceptions; rely-guarantee concurrency for all schedules and all depths; and separation logic with the frame rule, the magic wand, and inductive heap predicates.

Values carry their real types through the symbolic core; the heap models object identity, aliasing, and mutation; a sequence index is verified against the container's length and a dict key against the keys provably present (from a membership guard, an iteration, or a prior store), so a guarded access is proved safe and an unguarded one refuted with the out-of-range or missing-key witness; and these, with None in arithmetic, type mismatches, and division by zero, are traps that refute a totality claim. Fixed-width wraparound is checked alongside every integer proof. A property can be stated in Python over the parameters and result (prove), written as @require / @ensure decorators on the function (verify_contracts, in the style of the contracts and icontract packages), mined from the code's own assertions (check), or written directly as a Z3 predicate; a counterexample comes back with the execution trace and the path taken (explain).

SUBSET.md specifies the modeled subset precisely, in three tiers of trust, and enumerates the constructs that return UNKNOWN (async, metaclasses, non-identity decorators, dynamic reflection, and the rest), so it is clear in advance where a verdict will be withheld.

Soundness

Every construct is either encoded soundly or returned as UNKNOWN with a reason, so an unsupported feature is never silently skipped or assumed away. A PROVED is confirmed by a second independent solver (cvc5) and withheld unless both agree, runs under a deterministic resource bound so identical input yields an identical verdict on every machine, and carries a reproducibility certificate. proof_bundle exports that certificate as a re-checkable bundle -- the discharged SMT-LIB refutation queries, the solver versions, the deterministic configuration, and a content hash -- and recheck_bundle (or any SMT solver) re-verifies it independently, trusting the re-run solver rather than the engine that produced it.

prove, verify, and check establish partial correctness: a PROVED means that on every input meeting the precondition the function returns without a trap and the postcondition holds, not that it terminates -- termination is a separate obligation (verify_total, or check --total for trap freedom plus termination). When a true property exceeds the default deterministic bound and comes back UNKNOWN, a larger budget is available (touchstone ... --budget high), so that incompleteness is not invisible.

The trust base is machine-checked in Rocq (proofs/): the operational semantics of the modeled subset; the verification-condition generator over it, proven sound and complete for straight-line assignment and conditionals and sound for while-loops carrying a syntactic invariant, trap-aware so a reachable division or modulo by zero leaves the condition undischarged; a fixed-width two's-complement integer model proven to agree with unbounded arithmetic exactly when no operation overflows; the SMT-LIB division and modulo encoding proven to refine the theory for every conforming solver; the abstract-domain transfers (extracted to OCaml and run as the engine's operators); the translation as a semantics-preserving functor; and the end-to-end theorem that a discharged verification condition implies the property under the program's semantics. SMTCoq re-checks each integer obligation's certificate inside Coq's kernel.

Both the straight-line and the loop verification-condition generators are extracted from that proof to OCaml and transcribed in the engine, and a differential audit holds each transcription byte-for-byte equal to its extraction on a random corpus, so the generator code that runs is the one proven correct in Rocq, not a separate symbolic execution. The engine discharges the loop-free integer fragment through the straight-line generator directly, trap-aware so a reachable division by zero leaves the condition undischarged. With that core verified, the random differential checks against CPython are a completeness regression that measures precision rather than the barrier against an unsound verdict.

Type inference

The same symbolic core infers the type of every function return, parameter, and variable in code, in two modes. infer_types is over-approximating and sound: the reported set of type names is guaranteed to contain the value's runtime type (trusting a parameter's annotation as a contract where one is given), or the location is left UNKNOWN when no such bound can be established, so a stated type is never narrower than the truth. emit_facts is best-effort exact in the TypeEvalPy schema and discovers its own targets: it walks the module and emits a fact at every return, parameter, and binding it finds, typing each by carrying an argument's type across the call boundary into the parameter it reaches, following a value through reassignment and the narrowing of is None / isinstance guards, and resolving container element types and dict keys through the call graph, attributes set in a constructor, decorators, and generators.

Scored by TypeEvalPy's exact matcher at full source position, including column offset and with the analysis discovering each location, name, and kind on its own, the emitted facts are matched against the runtime-observed ground truth. As an independent check, the inference is run over pure-Python standard-library modules and compared against the type CPython produces at runtime: the sound mode's reported set is confirmed to contain the runtime type at every observed location, and the exact mode is held to the same emit-and-match standard, discovering the locations itself.

Recall is the fraction of ground-truth facts an emitted fact matches; precision the fraction of emitted facts that match one. The CPython cross-check knows a runtime type only at the locations its corpus reached, so its precision is measured over the emitted facts at those locations. python -m touchstone.typeeval runs the CPython cross-check; python -m touchstone.typeeval PATH scores a TypeEvalPy directory.

The commit rate is the fraction of ground-truth locations where the heuristic emits any type rather than abstaining (the refusal rate is its complement), and the accuracy where committed is the fraction of those commitments that match. Recall is the product of the two, so it rises either by committing on more locations or by being right more often where it commits; tracking them apart shows which lever is short.

Evaluation Recall Precision Commit Accuracy (committed)
TypeEvalPy micro-benchmark (emit-and-match) 812 / 868 (93.5%) 36.6% 96.0% 97.5%
TypeEvalPy autogen suite (emit-and-match) 71,653 / 77,268 (92.7%) 48.7% 94.1% 98.5%
CPython standard library (emit-and-match) 1067 / 1218 (87.6%) 86.4% 93.1% 94.1%

The sound mode reports a guaranteed bound rather than a single prediction, so it is measured differently: the commit rate is the fraction of locations where it gives a bound at all, the soundness the fraction of those whose bound contains the runtime type, and the exact rate where the bound equals the observed type. python -m touchstone.typeeval --sound reproduces it.

Sound mode (infer_types) Commit rate Soundness Exact
TypeEvalPy micro-benchmark 101 / 868 (11.6%) 100% 99.0%
TypeEvalPy autogen suite 13,829 / 77,268 (17.9%) 100% 98.5%
CPython standard library 54 / 1218 (4.4%) 100% 83.3%

The sound mode commits only where the type is fixed independent of the inputs (a literal, a fixed-return builtin, an operation over already-bounded values), so its reach is narrower than the heuristic's; but a committed bound is a proven over-approximation, and the runtime type fell inside every one.

The TypeEvalPy figures are scored against commit 3719de1 of TypeEvalPy, and the CPython cross-check ran on CPython 3.13.2; both the counts and the rates move with the benchmark commit and the standard-library version, so the tables are snapshots python -m touchstone.typeeval reproduces rather than fixed scores. The autogen suite is generated from templates the heuristic was tuned against and the micro suite is hand-written, whereas the CPython standard library is independent of both: it is the held-out measurement, and its lower accuracy is the out-of-distribution result the benchmark numbers do not show on their own.

Run

pip install touchstone-prover  # z3-solver and cvc5, pinned in pyproject.toml
python -m touchstone.ci      # self-tests, soundness audits, completeness regressions -> "CI OK"
python -m touchstone.examples  # one runnable example per capability, each verdict asserted
python -m touchstone.typeeval  # type-inference recall + precision (CPython cross-check)
python -m touchstone         # a demonstration

The machine-checked proofs run under the Rocq 9.0 opam switch; verify_coq.sh also invokes the SMTCoq certificate check when its separate toolchain (see proofs/toolchain.lock) is present, and skips it cleanly otherwise:

eval "$(opam env --switch=rocq9)" && cd proofs && bash verify_coq.sh

Layout

touchstone/      package: core, domains, engines, theories, vcgen, audit, ci, examples (_impl is the engine)
proofs/          Rocq + SMTCoq proofs, the extracted VC generators + interval operators, verify_coq.sh
.github/         continuous integration: the audits and the proof gate on every change
pyproject.toml   package metadata and pinned Python dependencies
SUBSET.md        the modeled subset of Python, in three tiers of trust, and what returns UNKNOWN

License

MIT. See LICENSE.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

touchstone_prover-0.3.13.tar.gz (315.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

touchstone_prover-0.3.13-py3-none-any.whl (323.3 kB view details)

Uploaded Python 3

File details

Details for the file touchstone_prover-0.3.13.tar.gz.

File metadata

  • Download URL: touchstone_prover-0.3.13.tar.gz
  • Upload date:
  • Size: 315.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for touchstone_prover-0.3.13.tar.gz
Algorithm Hash digest
SHA256 d4cbcdd4abc115fe1a25c35d504f961164432fdb01a8135bcbfa2a00db746305
MD5 33ec79c405f9002eda267e47badc238d
BLAKE2b-256 24af617ff667dfa02c90fabd91d9e7ede7005eab91c0c70f4a1941981edd6081

See more details on using hashes here.

Provenance

The following attestation bundles were made for touchstone_prover-0.3.13.tar.gz:

Publisher: publish.yml on CharlesCNorton/touchstone

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file touchstone_prover-0.3.13-py3-none-any.whl.

File metadata

File hashes

Hashes for touchstone_prover-0.3.13-py3-none-any.whl
Algorithm Hash digest
SHA256 ae072a9e8ded5ca3701fdafd68d8f117546ecb70c5186219b89c281b26413210
MD5 a3bac1875242c12cad8c958c672a2b0b
BLAKE2b-256 accb493d54d4206654e88968d127888af9d46c394041d1c04f42e33c53a9e552

See more details on using hashes here.

Provenance

The following attestation bundles were made for touchstone_prover-0.3.13-py3-none-any.whl:

Publisher: publish.yml on CharlesCNorton/touchstone

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page