Deterministic YAML pipeline engine for structured LLM extraction
Project description
pyconveyor
Deterministic YAML pipeline engine for structured LLM extraction.
pyconveyor lets you build reliable LLM extraction pipelines by declaring them in YAML. It handles prompt rendering, schema validation, self-correcting retries, parallel steps, batch processing, and benchmarking — so your code handles the domain logic, not the plumbing.
Install
pip install pyconveyor
A simple pipeline
Start with a single LLM step that extracts structured data from a scientific paper. Declare what you want in YAML — no Python required.
# pipeline.yaml
models:
default:
provider: openai_compat
api_key: ${OPENAI_API_KEY}
model: gpt-4o-mini
timeout: 120
steps:
- name: extract
type: llm
model: default
prompt: prompts/extract.j2
schema:
title: str
authors: list[str]
key_findings: list[str]
{# prompts/extract.j2 #}
Extract structured metadata from the following scientific paper.
Paper:
{{ ctx.paper }}
Return a JSON object with:
- "title": the paper title exactly as written
- "authors": list of author names
- "key_findings": up to 5 key findings as short sentences
pyconveyor run pipeline.yaml --input '{"paper": "Deep learning has revolutionized..."}'
That's it. pyconveyor calls the model, validates the output matches your schema, and retries automatically if the model returns something that doesn't fit.
Bootstrapping a project
Use pyconveyor init to scaffold a working project in one command:
pyconveyor init my_project/ --interactive
cd my_project/
export OPENAI_API_KEY=sk-...
pyconveyor run pipeline.yaml --input '{"paper": "..."}'
The interactive mode asks what you're extracting, which fields you need, and which provider to use. It generates pipeline.yaml, prompt templates, and editor autocomplete config — ready to run.
pyconveyor init my_project/ # static layout with schemas.py
pyconveyor init my_project/ --interactive # guided setup, inline schema
Rich field descriptions
Add descriptions to your schema fields and they appear automatically in a {{ schema_hint }} variable you can place in any prompt. pyconveyor builds a plain-English field listing for you — no more copying field docs between schema and prompt.
steps:
- name: extract
type: llm
model: default
prompt: prompts/extract.j2
schema:
title:
type: str
description: "Paper title exactly as written, including subtitle if present."
authors:
type: list[str]
description: "All author names in order. Include affiliation superscripts if present."
doi:
type: str | None
description: "DOI if listed. Null if not found."
pattern: "^10\\.[0-9]{4,}/.+$"
publication_year:
type: int
description: "Four-digit year of publication."
{# prompts/extract.j2 #}
Extract structured metadata from the following paper.
{{ schema_hint }}
Paper:
{{ ctx.paper }}
The {{ schema_hint }} renders as something like:
Return a JSON object with the following fields:
- **title** (str, required) — Paper title exactly as written, including subtitle if present.
- **authors** (list[str], required) — All author names in order. Include affiliation superscripts if present.
- **doi** (str | None) — DOI if listed. Null if not found.
- **publication_year** (int, required) — Four-digit year of publication.
You can also add pattern, min_length, max_length, min_items, and max_items constraints. Fields that fail constraints trigger a retry by default, or you can set on_fail: null to silently coerce invalid values to None, or on_fail: warn to log and continue.
Multiple steps
Pipelines grow naturally. Each step's result is available to later steps as {{ steps.name }}.
steps:
- name: extract
type: llm
model: default
prompt: prompts/extract.j2
schema:
title: str
abstract: str
methods: list[str]
- name: classify
type: llm
model: default
prompt: prompts/classify.j2
schema:
field: str
subfield: str | None
confidence: float
{# prompts/classify.j2 #}
Classify this paper into a research field based on its title and abstract.
Title: {{ steps.extract.title }}
Abstract: {{ steps.extract.abstract }}
Return:
- "field": the primary research field (e.g. "materials science", "molecular biology")
- "subfield": more specific subfield if identifiable
- "confidence": your confidence 0.0-1.0
Steps run in declaration order. A step can reference any prior step's output. The runner returns a RunContext with every step result, attempt logs, and timing.
Controlled vocabularies
Constrain a field to a known set of terms. pyconveyor normalises fuzzy matches and captures novel values for review.
Define your vocabularies as YAML files in a vocabularies/ directory:
# vocabularies/organism.yaml
known:
- Escherichia coli
- Saccharomyces cerevisiae
- Bacillus subtilis
- Pseudomonas aeruginosa
- Staphylococcus aureus
label: organism
growth_policy: auto # auto-approve close matches
Reference them on schema fields by filename:
steps:
- name: extract
type: llm
model: default
prompt: prompts/extract.j2
schema:
organism:
type: str
description: "Primary organism studied."
vocab: organism # loads vocabularies/organism.yaml
strain:
type: str | None
description: "Strain designation if reported."
Or define a small vocabulary inline — useful for ad-hoc constraints:
schema:
study_type:
type: str
description: "Type of study conducted."
vocab:
terms:
- in vitro
- in vivo
- in silico
- clinical trial
- field study
When the model returns "E. coli" instead of "Escherichia coli", pyconveyor normalises it automatically. When it returns a genuinely new organism, the value is captured as a suggestion. The {{ vocab_hints }} variable injects known terms into your prompt so the model knows the preferred vocabulary.
Review pending suggestions from the CLI:
pyconveyor vocab review
Self-correcting retries
When a model returns output that doesn't match your schema, pyconveyor feeds the errors back to the model and lets it try again.
steps:
- name: extract
type: llm
model: default
prompt: prompts/extract.j2
schema:
title: str
authors: list[str]
doi:
type: str | None
pattern: "^10\\.[0-9]{4,}/.+$"
max_attempts: 3 # give the model up to 3 tries
If the model returns a malformed DOI on the first attempt, the second attempt receives:
Your previous response failed schema validation. Here is what you returned:
{"title": "A Study of...", "authors": [...], "doi": "doi:10.1234/abc"}
Validation errors:
- doi: String must match pattern ^10\.[0-9]{4,}/.+$
Please fix these issues and return a corrected JSON object.
This works for both schema validation errors and JSON parse errors. You control which error types trigger retries with retry_on, cap the feedback size with max_feedback_tokens, and provide custom error templates with error_template.
Batch processing
Process hundreds of papers through the same pipeline with configurable parallelism:
pyconveyor batch pipeline.yaml --input papers.jsonl --output results.jsonl --workers 8
from pyconveyor import BatchRunner
runner = BatchRunner("pipeline.yaml", max_workers=8)
for paper_id, result in runner.run(papers):
if not result.failed:
save(result.steps["extract"].value)
Benchmarking
Measure extraction accuracy against a set of known-correct cases:
# Create a benchmark case
mkdir -p benchmarks/paper_001
cat > benchmarks/paper_001/input.yaml << 'EOF'
paper: "Smith et al. (2024) demonstrate that CRISPR-Cas9..."
EOF
cat > benchmarks/paper_001/expected.yaml << 'EOF'
extract:
title: "CRISPR-Cas9 Applications in Gene Therapy"
authors: ["J. Smith", "A. Chen", "M. Patel"]
EOF
# Run the benchmark
pyconveyor benchmark benchmarks/ --pipeline pipeline.yaml --report report.html
Compare two pipeline versions side by side, get per-field accuracy scores, and generate HTML reports with charts and Mermaid graphs. Supports YAML and JSON benchmark files, large inputs via $file references, and PDF export.
Ensemble — multi-model consensus
Run multiple models in parallel and auto-merge their outputs:
steps:
- name: extract
type: ensemble
schema: schemas:PaperMetadata
prompt: prompts/extract.j2
members:
- model: gpt4o
- model: claude
required: false # pipeline continues if this model fails
judge:
model: gpt4o # reviews all outputs, returns merged result
condition: all_succeeded
Member results are accessible individually as steps.extract.gpt4o and steps.extract.claude. If the judge is skipped or fails, the first succeeded member's result is returned.
Schema files and code reuse
As pipelines grow, you can move your schemas to a schemas.py file:
# schemas.py
from pydantic import BaseModel
class PaperMetadata(BaseModel):
title: str
authors: list[str]
doi: str | None
publication_year: int
class Classification(BaseModel):
field: str
subfield: str | None
confidence: float
Reference them in your pipeline:
steps:
- name: extract
type: llm
model: default
prompt: prompts/extract.j2
schema: schemas:PaperMetadata
- name: classify
type: llm
model: default
prompt: prompts/classify.j2
schema: schemas:Classification
You can mix inline schemas and Python model references in the same pipeline. Inline schemas are great for getting started; schemas.py gives you full Pydantic power when you need cross-field validators, computed properties, or shared model definitions.
Providers
pyconveyor works with any OpenAI-compatible endpoint. Just change base_url:
| Provider | Configuration |
|---|---|
| OpenAI | provider: openai_compat |
| Anthropic | provider: anthropic + pip install pyconveyor[anthropic] |
| Ollama / vLLM / LM Studio | provider: openai_compat + base_url: http://localhost:11434/v1 |
| Custom | @register_provider("name") decorator |
| Testing | provider: mock — no API calls |
CLI reference
pyconveyor init <dir> Bootstrap a new project
pyconveyor init <dir> --interactive Guided setup — define fields interactively
pyconveyor run <pipeline.yaml> Run a pipeline
pyconveyor validate <pipeline> Validate without running
pyconveyor batch <pipeline> Batch process a JSONL file
pyconveyor benchmark <dir> Benchmark against golden-standard cases
pyconveyor vocab review Review pending vocabulary suggestions
pyconveyor schema Emit JSONSchema for editor autocomplete
pyconveyor schema infer <pipeline> Infer schemas.py from sample output
pyconveyor visualise <pipeline> Print Mermaid DAG diagram
Python API
from pyconveyor import PipelineRunner, BatchRunner, BenchmarkRunner, generate_report
# Single run
runner = PipelineRunner("pipeline.yaml")
result = runner.run({"paper": "..."})
result.failed # bool
result.steps["extract"].value # Pydantic model or dict
result.steps["extract"].last_attempt # AttemptLog with timing and token counts
result.summary() # RunSummary with aggregates
# Batch
batch_runner = BatchRunner("pipeline.yaml", max_workers=8)
for item_id, result in batch_runner.run(records):
save(result.steps["extract"].value)
# Benchmark
bench = BenchmarkRunner("benchmarks/", pipelines=["pipeline.yaml"])
summary = bench.run()
generate_report(summary, "report.html")
Load-time validation
PipelineRunner("pipeline.yaml") validates everything before spending any tokens — model references, schema imports, expressions, step names. Errors include the YAML line number and "did you mean?" suggestions.
pyconveyor validate pipeline.yaml
# ✓ pipeline.yaml is valid
Versioning policy
The YAML pipeline format is treated as a public API subject to the same semver rules as the Python API. A breaking change to the YAML schema will increment the major version.
Documentation
Full documentation at pyconveyor.readthedocs.io
- Quickstart
- Step Types
- Benchmarking
- Vocabulary Fields
- Batch Processing
- Response Caching
- YAML Schema Reference
- CLI Reference
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyconveyor-1.7.0.tar.gz.
File metadata
- Download URL: pyconveyor-1.7.0.tar.gz
- Upload date:
- Size: 270.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b0cbb2b984c3c964b11eea0eaf33e92bfca9d2fdc0f43a8b60dd2a82f08424b
|
|
| MD5 |
32a35bf63da5a8e1a99b0168dd2b785c
|
|
| BLAKE2b-256 |
ae179e76473bfc2ebeb88f0af9f064c6f4cdbdfd263b7a22c498e36e00973a3e
|
Provenance
The following attestation bundles were made for pyconveyor-1.7.0.tar.gz:
Publisher:
publish.yml on VictorGambarini/pyconveyor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyconveyor-1.7.0.tar.gz -
Subject digest:
0b0cbb2b984c3c964b11eea0eaf33e92bfca9d2fdc0f43a8b60dd2a82f08424b - Sigstore transparency entry: 1521330293
- Sigstore integration time:
-
Permalink:
VictorGambarini/pyconveyor@9f2243357fcbb6f1229d0d227b1d7d753d56b36d -
Branch / Tag:
refs/tags/v1.7.0 - Owner: https://github.com/VictorGambarini
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9f2243357fcbb6f1229d0d227b1d7d753d56b36d -
Trigger Event:
push
-
Statement type:
File details
Details for the file pyconveyor-1.7.0-py3-none-any.whl.
File metadata
- Download URL: pyconveyor-1.7.0-py3-none-any.whl
- Upload date:
- Size: 82.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15cb847e06998d9ca69446a9f353ebaa178f72afc617e90280855ab306d20540
|
|
| MD5 |
1719c72be0bfb63830a4820808e15cc6
|
|
| BLAKE2b-256 |
4c6d277244fba92faaef78e6a950fa2237a6adb4ce0a881774f44f349d90bb6c
|
Provenance
The following attestation bundles were made for pyconveyor-1.7.0-py3-none-any.whl:
Publisher:
publish.yml on VictorGambarini/pyconveyor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyconveyor-1.7.0-py3-none-any.whl -
Subject digest:
15cb847e06998d9ca69446a9f353ebaa178f72afc617e90280855ab306d20540 - Sigstore transparency entry: 1521330335
- Sigstore integration time:
-
Permalink:
VictorGambarini/pyconveyor@9f2243357fcbb6f1229d0d227b1d7d753d56b36d -
Branch / Tag:
refs/tags/v1.7.0 - Owner: https://github.com/VictorGambarini
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9f2243357fcbb6f1229d0d227b1d7d753d56b36d -
Trigger Event:
push
-
Statement type: