AST-driven constrained decoding: generate random, always-valid Python programs
Project description
AST‑CD
AST-driven constrained decoding — generate random, syntactically valid Python programs.
astcd samples programs from a restricted Python grammar by deciding, at every
step, the exact set of tokens that are legal next — so every program it produces
is syntactically valid, scope-correct (no undefined variables), and bounded in
size by construction. There's no generate-then-reject loop and no Python
source parser: the token stream is a preorder serialization of the AST.
import random
from astcd.vocab import Vocabulary
from astcd.types import DecodeConfig, Start
from astcd.decoder import sample_tokens, weighted_policy, PolicyConfig
from astcd.codec import tokens_to_ast, ast_to_source
from astcd.exec import run
cfg = DecodeConfig()
tok = Vocabulary(cfg)
rng = random.Random(2)
toks = sample_tokens(Start.FUNCTION, rng, cfg, tok, max_len=30,
policy=weighted_policy(PolicyConfig(), tok, rng))
fn = tokens_to_ast(toks, tok)
print(ast_to_source(fn))
print("output:", run(fn, (3, 4)).value)
def func(a, b):
d = a
return ((b * (d + d)) * a)
output: 72
Why
- Constrained decoding for LLM Python AST code generation.
Decoderexposes the exact legal-token mask at each step, so a model can generate only programs inside the supported AST subset. - Random Python code generation for data augmentation. The same grammar, scope, and budget constraints can sample many small, well-formed programs for training data or evaluation corpora.
- No generate-then-reject loop. Every sampled token stream decodes to syntactically valid, scope-safe restricted Python by construction.
- Safe + deterministic by construction. A small direct interpreter (no
exec/compile) runs programs under integer/step caps and returns a structured outcome. Overflow and step-limit cases are reported as data; nothing can import or do I/O.
Install
pip install astcd # core (pure standard library, zero dependencies)
pip install "astcd[gif]" # + Pillow, to export the decode animation as a GIF
The constrained-decoding API
The core object is a Decoder: ask it what's legal, advance with any of those
tokens, repeat. That seam is where a random policy, a weighted prior, or a neural
model all plug in identically.
from astcd.decoder import Decoder
d = Decoder(Start.FUNCTION, cfg, tok, max_len=30)
while not d.is_complete():
legal = d.next_legal_tokens() # the mask: grammar ∩ scope ∩ depth/size guards
d.advance(my_policy(legal)) # pick any legal token (rng / weighted / model)
fn = d.function()
next_legal_tokens() is intersected from four sources: the grammar's FIRST set,
the in-scope variable names (definite assignment), the depth/length guards, and a
min-cost-to-complete check that guarantees the program closes within max_len.
It is never empty until the program is complete.
What it generates
A small, bounded subset of Python over integers:
def func(a, b): # arity 1..N
d = a * b + 1 # assignments to fresh / existing locals
if d < 0: # if / else
return 0 # return is a statement (early returns OK)
return d # ...or fall through to None
Optional features (off by default) via DecodeConfig: for v in range(e): loops
(allow_loops), and the size/shape knobs arities, max_block_stmts,
max_expr_depth, max_control_depth, and the integer value_range /
const_range.
Watch it decode
The most fun way to understand it — animate a program assembling itself one legal
token at a time, holes (<expr>, <stmt>, ...) filling in:
python -m astcd.viz # animate a random function
python -m astcd.viz --seed 7 --delay 0.3
python -m astcd.viz --uniform # flat policy (more complex programs)
python -m astcd.viz --loops --max-control-depth 3
python -m astcd.viz --play # interactive: YOU pick each token from the menu
python -m astcd.viz --gif decode.gif # export the animation
Status
Early and experimental, but the core is complete and tested (pytest):
grammar / vocabulary / scope / interpreter / codec round-trip / decoder soundness /
sampling policies / visualization.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file astcd-0.1.0.tar.gz.
File metadata
- Download URL: astcd-0.1.0.tar.gz
- Upload date:
- Size: 566.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e25a461cb66ec628def784c3fd1c950c0fc4a436945069f93d1dc04b666eb016
|
|
| MD5 |
695be3f81ba08c77f23898ea711c5900
|
|
| BLAKE2b-256 |
b7b0a9287dd902cb8ee578c77ded4c66ce5e6d4f0606b194d576145621a7a725
|
File details
Details for the file astcd-0.1.0-py3-none-any.whl.
File metadata
- Download URL: astcd-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f43efb5cc4bcedd6ffc92fb8578cf22d86354528e4564482124b64e820d06f63
|
|
| MD5 |
e1a7beb0b96b224a38529b3530ee67e2
|
|
| BLAKE2b-256 |
8c614bece0ddb62e378bb72a83307792f3a14889e3415cdcab8bfbd1dfab6ee3
|