Skip to main content

AST-driven constrained decoding: generate random, always-valid Python programs

Project description

AST‑CD

AST-driven constrained decoding — generate random, syntactically valid Python programs.

A program with a for-loop assembling itself one legal token at a time

astcd samples programs from a restricted Python grammar by deciding, at every step, the exact set of tokens that are legal next — so every program it produces is syntactically valid, scope-correct (no undefined variables), and bounded in size by construction. There's no generate-then-reject loop and no Python source parser: the token stream is a preorder serialization of the AST.

import random
from astcd.vocab import Vocabulary
from astcd.types import DecodeConfig, Start
from astcd.decoder import sample_tokens, weighted_policy, PolicyConfig
from astcd.codec import tokens_to_ast, ast_to_source
from astcd.exec import run

cfg = DecodeConfig()
tok = Vocabulary(cfg)
rng = random.Random(2)

toks = sample_tokens(Start.FUNCTION, rng, cfg, tok, max_len=30,
                     policy=weighted_policy(PolicyConfig(), tok, rng))
fn = tokens_to_ast(toks, tok)
print(ast_to_source(fn))
print("output:", run(fn, (3, 4)).value)
def func(a, b):
    d = a
    return ((b * (d + d)) * a)
output: 72

Why

  • Constrained decoding for LLM Python AST code generation. Decoder exposes the exact legal-token mask at each step, so a model can generate only programs inside the supported AST subset.
  • Random Python code generation for data augmentation. The same grammar, scope, and budget constraints can sample many small, well-formed programs for training data or evaluation corpora.
  • No generate-then-reject loop. Every sampled token stream decodes to syntactically valid, scope-safe restricted Python by construction.
  • Safe + deterministic by construction. A small direct interpreter (no exec/compile) runs programs under integer/step caps and returns a structured outcome. Overflow and step-limit cases are reported as data; nothing can import or do I/O.

Install

pip install astcd            # core (pure standard library, zero dependencies)
pip install "astcd[gif]"     # + Pillow, to export the decode animation as a GIF

The constrained-decoding API

The core object is a Decoder: ask it what's legal, advance with any of those tokens, repeat. That seam is where a random policy, a weighted prior, or a neural model all plug in identically.

from astcd.decoder import Decoder

d = Decoder(Start.FUNCTION, cfg, tok, max_len=30)
while not d.is_complete():
    legal = d.next_legal_tokens()     # the mask: grammar ∩ scope ∩ depth/size guards
    d.advance(my_policy(legal))       # pick any legal token (rng / weighted / model)
fn = d.function()

next_legal_tokens() is intersected from four sources: the grammar's FIRST set, the in-scope variable names (definite assignment), the depth/length guards, and a min-cost-to-complete check that guarantees the program closes within max_len. It is never empty until the program is complete.

What it generates

A small, bounded subset of Python over integers:

def func(a, b):           # arity 1..N
    d = a * b + 1         # assignments to fresh / existing locals
    if d < 0:             # if / else
        return 0          # return is a statement (early returns OK)
    return d              # ...or fall through to None

Optional features (off by default) via DecodeConfig: for v in range(e): loops (allow_loops), and the size/shape knobs arities, max_block_stmts, max_expr_depth, max_control_depth, and the integer value_range / const_range.

Watch it decode

The most fun way to understand it — animate a program assembling itself one legal token at a time, holes (<expr>, <stmt>, ...) filling in:

python -m astcd.viz                     # animate a random function
python -m astcd.viz --seed 7 --delay 0.3
python -m astcd.viz --uniform           # flat policy (more complex programs)
python -m astcd.viz --loops --max-control-depth 3
python -m astcd.viz --play              # interactive: YOU pick each token from the menu
python -m astcd.viz --gif decode.gif    # export the animation

Status

Early and experimental, but the core is complete and tested (pytest): grammar / vocabulary / scope / interpreter / codec round-trip / decoder soundness / sampling policies / visualization.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astcd-0.1.0.tar.gz (566.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

astcd-0.1.0-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file astcd-0.1.0.tar.gz.

File metadata

  • Download URL: astcd-0.1.0.tar.gz
  • Upload date:
  • Size: 566.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for astcd-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e25a461cb66ec628def784c3fd1c950c0fc4a436945069f93d1dc04b666eb016
MD5 695be3f81ba08c77f23898ea711c5900
BLAKE2b-256 b7b0a9287dd902cb8ee578c77ded4c66ce5e6d4f0606b194d576145621a7a725

See more details on using hashes here.

File details

Details for the file astcd-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: astcd-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for astcd-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f43efb5cc4bcedd6ffc92fb8578cf22d86354528e4564482124b64e820d06f63
MD5 e1a7beb0b96b224a38529b3530ee67e2
BLAKE2b-256 8c614bece0ddb62e378bb72a83307792f3a14889e3415cdcab8bfbd1dfab6ee3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page