Verify that implementations satisfy intent specifications. A pure verifier for AI-generated Python code — it checks, it never generates.
Project description
pyintent
Write down what a function is supposed to do. Let an AI coding tool write the implementation. pyintent checks that the implementation actually does what you said.
from pyintent import spec, reads, throws
@spec(
intent = "Return the order with the given id from the database.",
where = ["order_id > 0"],
ensures = ["result.id == order_id"],
effects = [reads("db"), throws(NotFoundError)],
ex = [
"(42,) -> _",
"(999,) -> raises NotFoundError",
"(0,) -> raises ValueError",
],
)
def find_order(order_id: int) -> Order:
... # implemented by Claude Code / Copilot / Devin / you
pyintent verify myapp/orders.py # run all verifiers, human-readable report
pytest --pyintent # or run specs as pytest items
pyintent is a pure verifier. It never calls an LLM. Like mypy checks types without generating code, pyintent checks that an implementation matches its declared intent: examples, pre/post-conditions, effects, and types. Every check is deterministic: it passes, or it tells you exactly what's wrong and where.
Why I built this
I started with a different question: would LLMs write better code if we gave them a language designed for them? I prototyped one with Replit agents. The answer was no. A model that has never seen a language in its training data writes it badly, no matter how friendly the syntax is, and no amount of prompt-side documentation made up the gap. The training data wins.
But the research I read while designing that language kept pointing at a handful of things that reliably do help LLMs produce correct code: stating intent explicitly, giving concrete input/output examples, declaring side effects, and giving the model fast deterministic feedback it can iterate against. (The specific papers are in The research behind the design.)
So pyintent is that language idea inverted. Instead of new syntax the model has never seen, it's a contract you attach to ordinary Python, the language LLMs know best, plus a verifier that checks the contract mechanically. The spec states what the code should do; the AI writes the code; pyintent verify closes the loop.
The research behind the design
Nothing in @spec is invented here. Each piece comes from somewhere, either from the contracts-and-testing literature or from what the LLM code-generation papers found actually helps. These are the papers that shaped the design.
The contract idea is old. Preconditions and postconditions as the definition of program correctness go back to Dijkstra's guarded commands (1975), and attaching them to code as first-class, checkable clauses is Meyer's Design by Contract (1992). where, ensures, and invariants are that vocabulary almost verbatim. At the heavyweight end, Dafny (Leino, 2010) proves contracts statically. pyintent sits deliberately at the lightweight end: same contract shape, checked by execution and property testing rather than proof, on unmodified Python.
Checking contracts by generated inputs. The technique behind the properties verifier is QuickCheck (Claessen & Hughes, 2000): generate random inputs, assert declared properties. pyintent uses its Python descendant Hypothesis (MacIver et al.) directly. Vikram et al. found the painful parts of property-based testing are writing input generators and inventing properties; pyintent builds generators from your type hints, so the spec author only writes the properties.
Why not a new language. Code LLM quality tracks how often a language appears in training data. MultiPL-E (Cassano et al.) benchmarked code generation across 19 languages and found performance drops off sharply for low-resource ones, and MultiPL-T showed that closing the gap takes large-scale training data, not something you can fix from the prompt side. A brand-new language is the lowest-resource language possible, which is why pyintent attaches contracts to Python instead.
Why examples are the core of the spec. AlphaCode (Li et al.) credited much of its competition performance to filtering a huge pool of sampled programs by executing them against the problem's example tests; CodeT (Chen et al.) got large gains from the same move with generated tests. TiCoder (Lahiri et al.) showed that a handful of concrete test cases also works as a formalization of user intent: it pins down what you meant, not just what passes. pyintent's ex field is that idea as a one-line syntax.
Why postconditions. Endres et al. found that natural-language intent can be translated into formal postconditions that are usually correct and genuinely discriminating: their generated postconditions caught 64 real historical bugs in Defects4J. ensures is the hand-written version of the same contract.
Why a deterministic verifier in the loop. Iterating against feedback demonstrably helps: Self-Debugging (Chen et al.) with execution results, Reflexion (Shinn et al.) with environment signals like failing tests, Self-Refine (Madaan et al.) with the model's own critique. But Olausson et al. found self-repair is bottlenecked exactly there: models are bad at judging their own code. That's the argument for pyintent being a pure verifier: the feedback the model iterates against comes from running the contract, never from asking a model whether it thinks the code is right.
What I can and can't claim
I tried to show that pyintent improves LLM coding-benchmark scores and never found an experiment design that made a convincing case. Benchmark tasks like HumanEval and MBPP are small, self-contained, and already specified by their test suites, which is precisely the situation where a spec layer adds the least. If anything, the benchmark literature argues for richer checks than benchmarks themselves use: EvalPlus added 80× more tests to HumanEval and found a meaningful share of "passing" solutions were wrong.
The bet I'm actually making is on maintainability. A test suite tells you what code does today; a spec records what it was for. Six months from now, when an AI tool (or a person) rewrites find_order, the contract is still attached to the function: the intent, the examples, the declared effects, the exceptions it's allowed to raise. Any regeneration of the code gets re-verified against the original intent instead of against whatever the last version happened to do.
That's a hypothesis, not a measured result. If you have an idea for a fair experiment, I'd genuinely like to hear it. Open an issue.
Install
pip install pyintent # core (examples, properties, effects)
pip install pyintent[types] # + mypy integration
pip install pyintent[dev] # + mypy + pytest-asyncio
Or with conda:
conda install -c conda-forge pyintent
Quick start
from pyintent import spec, pure
@spec(
intent = "Return the absolute value of x.",
effects = [pure],
ensures = ["result >= 0", "result == x or result == -x"],
ex = ["(3,) -> 3", "(-4,) -> 4", "(0,) -> 0"],
)
def my_abs(x: int) -> int:
return x if x >= 0 else -x
$ pyintent verify mymodule.py
[PASS] examples my_abs (3,) -> 3
[PASS] examples my_abs (-4,) -> 4
[PASS] examples my_abs (0,) -> 0
[PASS] properties my_abs 2 ensures held over 50 examples
[PASS] effects my_abs pure no impure calls found
[PASS] types mymodule.py mypy clean
6 passed 0 failed 0 errored 0 skipped
The @spec decorator
@spec accepts these fields:
| Field | Type | Description |
|---|---|---|
intent |
str (required) |
One-line description of what the function does and why. |
where |
list[str] |
Preconditions: Python expressions that must hold over the inputs. |
ensures |
list[str] |
Postconditions: Python expressions over inputs and result. |
effects |
list[Effect] |
Declared side-effects (see below). |
ex |
list[str] |
Runnable examples in "(args) -> expected" format. |
perf |
Perf |
Advisory complexity, e.g. Perf(time="O(n)"). |
invariants |
list[str] |
Class/module-level invariants (plain strings or expressions). |
@spec must be the outermost decorator and returns the target unchanged. It only attaches metadata, so there is no runtime overhead.
Verifiers
examples — run concrete cases
Each ex string has the format "(args) -> expected":
ex = [
"(1, 2) -> 3", # must return 3
"(0,) -> raises ValueError", # must raise ValueError
"('hi',) -> _", # wildcard: any return without raising
]
- The left side is a tuple literal (single-arg tuples need a trailing comma:
(42,)). raises ExcTypematches if the call raises that type or a subclass._matches any non-raising return.- Values are evaluated in the module's global namespace, so domain objects and enums resolve correctly.
properties — hypothesis-based postcondition testing
For functions with ensures and no impure effects, pyintent generates inputs from type hints using Hypothesis, filters them through where, and asserts every ensures expression:
@spec(
intent = "Sort a list of integers in ascending order.",
effects = [pure],
where = ["len(xs) < 1000"],
ensures = [
"len(result) == len(xs)",
"all(result[i] <= result[i+1] for i in range(len(result)-1))",
],
)
def sort_ints(xs: list[int]) -> list[int]:
return sorted(xs)
ensures expressions may reference input parameters and result (the return value).
types — mypy integration
Runs mypy over the target file. Skipped (not failed) if mypy is not installed. Install it with pip install pyintent[types].
effects — AST-based effect checking
Three effects are actively verified in v0.1:
| Effect | What is checked |
|---|---|
pure |
No calls to impure builtins (print, open, …) or modules (os, sys, random, requests, …), no global/nonlocal writes. |
async_ |
The function must be defined with async def. |
throws(ExcA, ExcB) |
Every explicitly raised exception type is declared. Raises that don't statically name a type (raise err through a variable, raise make_err() through a factory) are skipped and counted in the summary. |
These effects are declaration-only (recorded but not verified):
reads("db"), writes("cache"), network("stripe"), io
A function may combine multiple effects:
effects = [reads("db"), throws(NotFoundError, ValueError)]
The purity check is deliberately shallow — it doesn't follow calls into helper functions. It catches the common cases and reports the offending file line for each violation.
CLI usage
# Write the spec-authoring guide into your AI tool's prompt files
# (AGENTS.md, CLAUDE.md, .github/copilot-instructions.md, etc.)
pyintent init
# Print the spec-authoring guide to stdout
pyintent prompt
# Validate spec structure by importing files (functions are not called)
pyintent check myapp/
# Require every public function to have a @spec (names starting
# with "_" are exempt)
pyintent check --require-specs myapp/
# Run all verifiers and report results
pyintent verify myapp/orders.py
pyintent verify myapp/
# Machine-readable JSON output
pyintent verify --json myapp/ > results.json
# Run only specific verifiers
pyintent verify --only examples --only properties myapp/
Exit codes: 0 all good, 1 verification failures, 2 usage or load error.
pytest plugin
The pytest plugin is opt-in; installing pyintent does not change how existing pytest runs behave.
Enable it on the command line:
pytest --pyintent
Or permanently in pyproject.toml:
[tool.pytest.ini_options]
pyintent = true
Each spec becomes one or more pytest items:
- One item per
excase - One item for property testing (if
ensuresis set) - One item for the type check per file
pyproject.toml configuration
[tool.pyintent]
require_specs = true # or "all" to also require class/module specs
exclude = ["migrations", "tests"]
exclude entries are glob patterns matched against path segments ("migrations"), relative-path prefixes ("src/migrations"), or filenames ("*_pb2.py"), anchored at the directory containing the pyproject.toml. Both the CLI and pytest --pyintent skip excluded files without importing them. A file you name explicitly on the command line is always checked, even if a pattern matches it.
Safety
pyintent imports the files it checks, and the examples and properties verifiers execute the code under test in the current Python process. That's fine for your own code. But pyintent's whole premise is checking code written by an AI tool, so treat that code as untrusted: review it, or run pyintent verify in a sandbox (container, VM, or restricted user) first.
Status
v0.1. Planned for v0.2: generator/async-generator specs, @overload, instance-method example execution, Liskov enforcement of abstract-method contracts, performance measurement.
Contributing
See CONTRIBUTING.md.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyintent-0.1.1.tar.gz.
File metadata
- Download URL: pyintent-0.1.1.tar.gz
- Upload date:
- Size: 44.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50cd391c1ff36b98de52cce0b040c38bf81aa939d5c7f31c10284c2c11f5d18a
|
|
| MD5 |
55f928a7d4309797035b3e13a16409bf
|
|
| BLAKE2b-256 |
799333d9a06c733930f7701387da69dc288459f75076735187341c0e9a24eff2
|
File details
Details for the file pyintent-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pyintent-0.1.1-py3-none-any.whl
- Upload date:
- Size: 40.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ab715ea19001b473843651da2f06a9cf2d0f93ad42093b6f292e5141b588b8a
|
|
| MD5 |
7b8b7b922316ff5aa6b0ecc7bb612636
|
|
| BLAKE2b-256 |
5809b7159dc6ec4bdc0572c3673642dd3a61c1d8e7d72ee5f97adde4b762d2a4
|