Skip to main content

Search for compact executable verifier sets for LLM outputs.

Project description

AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

This repository is the implementation for the paper "AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs".

alt text


AutoPyVerifier

AutoPyVerifier is a pipeline for searching over deterministic Python verifier bundles for labeled LLM outputs.

Given a development set of (query, model_output, objective) examples and a task description, the system uses an LLM to iteratively:

  1. propose initial verifier bundles,
  2. critique their failures,
  3. refine them into new candidates,
  4. execute each bundle in a restricted sandbox,
  5. search over the DAG, and
  6. select a compact verifier set that best balances the score, exploration, feasibility, and size.

๐Ÿ“‚ Repository structure

.
โ”œโ”€โ”€ srcโ”œโ”€โ”€ autopyverifier/
โ”‚           โ”œโ”€โ”€ cli.py              # CLI entry point
โ”‚           โ”œโ”€โ”€ config.py           # search and model configuration dataclasses
โ”‚           โ”œโ”€โ”€ data.py             # JSONL devset loading utilities
โ”‚           โ”œโ”€โ”€ execution.py        # verifier parsing, sandboxing, execution
โ”‚           โ”œโ”€โ”€ metrics.py          # scoring and feasibility metrics
โ”‚           โ”œโ”€โ”€ models.py           # shared dataclasses
โ”‚           โ”œโ”€โ”€ prompts.py          # seed / critic / refine / context prompts
โ”‚           โ”œโ”€โ”€ search.py           # main single-DAG search loop
โ”‚           โ””โ”€โ”€ llm/
โ”‚               โ”œโ”€โ”€ base.py
โ”‚               โ”œโ”€โ”€ mock.py
โ”‚               โ”œโ”€โ”€ openai_llms.py
โ”‚               โ”œโ”€โ”€ gemini_llms.py
โ”‚               โ””โ”€โ”€ claude_llms.py
โ”œโ”€โ”€ data/
    โ””โ”€โ”€ toy/
        โ”œโ”€โ”€ devset.jsonl
        โ””โ”€โ”€ task_description.txt

โš™๏ธ Requirements

Use Python >= 3.10.18.

Install dependencies:

pip install -r requirements.txt

๐Ÿ”‘ API keys

Set only the key for the backend you plan to use. For example:

export OPENAI_API_KEY="..."

๐Ÿ“ฅ Input format

The development set is a JSONL file. Each line should have:

  • id: example identifier
  • query: task input
  • output: model output to verify
  • objective: 1 if the output satisfies the target objective, else 0
  • metadata (optional): extra per-example metadata

Example:

{"id": "m1", "query": "Solve x^2 - 5x + 6 = 0.", "output": "x^2 - 5x + 6 = (x-2)(x-3), so x=2 or x=3.", "objective": 1}
{"id": "m2", "query": "Solve x^2 - 5x + 6 = 0.", "output": "The roots are 1 and 6.", "objective": 0}

The task description is a plain-text file describing:

  • what query and output represent,
  • what objective labels 1 and 0 mean,
  • what kinds of verifier logic are allowed or desired, and
  • what the search should optimize for.

๐Ÿš€ Quick Start

To run the toy example, from the project root:

pip install -e .

python -m autopyverifier.cli search \
  --devset data/toy/devset.jsonl \
  --task_description_file data/toy/task_description.txt \
  --llm_backend openai \
  --seed_model gpt-5.4 \
  --critic_model gpt-5.4 \
  --refine_model gpt-5.4 \
  --context_model gpt-5.4 \
  --budget 20 \
  --feasible_coef 0.1 \
  --explore_coef 0.1 \
  --size_coef 0.1 \
  --out_dir results/toy/gpt54

Useful optional flags:

  • --budget: number of search iterations
  • --temperature: sampling temperature passed to the backend
  • --max_output_tokens: output token budget for model calls
  • --beta_pp: feasibility threshold for lower-confidence acceptance precision
  • --beta_np: feasibility threshold for lower-confidence rejection precision
  • --timeout_seconds: per-bundle execution timeout
  • --out_dir: where to write search artifacts

๐Ÿ—‚๏ธ What gets written to out_dir

When --out_dir is provided, the search writes:

  • selected_verifier.py: source code for the chosen verifier bundle
  • selected_verifier.json: summary of the chosen verifier
  • graph.json: metadata for all explored nodes
  • summary.json: high-level search summary
  • nodes/*.py: source code for each explored verifier bundle

โญ Citation

If you would like to cite our work, the bibtex is:

@article{pezeshkpour2026autopyverifier,
title={AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs},
author={Pezeshkpour, Pouya and Hruschka, Estevam},
year={2026}
}

๐Ÿ“œ Disclosure

Embedded in, or bundled with, this product are open source software (OSS) components, datasets and other third party components identified below. The license terms respectively governing the datasets and third-party components continue to govern those portions, and you agree to those license terms, which, when applicable, specifically limit any distribution. You may receive a copy of, distribute and/or modify any open source code for the OSS component under the terms of their respective licenses, which may be CC license and Apache 2.0 license. In the event of conflicts between Megagon Labs, Inc., license conditions and the Open Source Software license conditions, the Open Source Software conditions shall prevail with respect to the Open Source Software portions of the software. You agree not to, and are not permitted to, distribute actual datasets used with the OSS components listed below. You agree and are limited to distribute only links to datasets from known sources by listing them in the datasets overview table below. You are permitted to distribute derived datasets of data sets from known sources by including links to original dataset source in the datasets overview table below. You agree that any right to modify datasets originating from parties other than Megagon Labs, Inc. are governed by the respective third partyโ€™s license conditions. All OSS components and datasets are distributed WITHOUT ANY WARRANTY, without even implied warranty such as for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, and without any liability to or claim against any Megagon Labs, Inc. entity other than as explicitly documented in this README document. You agree to cease using any part of the provided materials if you do not agree with the terms or the lack of any warranty herein. While Megagon Labs, Inc., makes commercially reasonable efforts to ensure that citations in this document are complete and accurate, errors may occur. If you see any error or omission, please help us improve this document by sending information to contact_oss@megagon.ai.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autopyverifier-0.1.0.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autopyverifier-0.1.0-py3-none-any.whl (25.4 kB view details)

Uploaded Python 3

File details

Details for the file autopyverifier-0.1.0.tar.gz.

File metadata

  • Download URL: autopyverifier-0.1.0.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for autopyverifier-0.1.0.tar.gz
Algorithm Hash digest
SHA256 45c7c5b1f022ea7b0f23421c5fa8e4abd2816e7b9ac3243865ac9b34b3a7d5d7
MD5 792a1ee2a3a2f3e33a1c537718fac1f3
BLAKE2b-256 73f07532bfa001b04c30c0475425369707257b6843f4ffbdd1e0545e02024332

See more details on using hashes here.

File details

Details for the file autopyverifier-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: autopyverifier-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for autopyverifier-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 10e432ecad0ff0a3fc768cb360bfe46136a391cedc8b974ae4f616d824775f8f
MD5 c032a8540cfeaa446ce57b480f49780b
BLAKE2b-256 191b6c6ed08303b425d347eb49348aefbfdf906b2366827204006a4e75e669b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page