consist

Provenance tracking, intelligent caching, and data virtualization for scientific simulation workflows.

These details have not been verified by PyPI

Project description

Consist

Python 3.11+

Consist is a caching layer for scientific simulation workflows that makes provenance queryable. It automatically records what code, configuration, and input data produced each output in your pipeline—eliminating redundant computation and enabling post-hoc inspection of results via SQL.

Why Consist?

Multi-run simulation workflows typically accumulate friction:

Provenance ambiguity: "Which configuration produced those results in Figure 3?"
Redundant computation: Re-running a 4-hour pipeline because you changed one unrelated parameter.
Scattered outputs: Finding and comparing results across scenario variants manually.
Hidden wiring: Tools with implicit dependencies (name-based injection, global state) are hard to debug and modify when something breaks.

Consist tracks lineage explicitly. Tasks are ordinary Python functions; dependencies flow through concrete values, not framework magic. Your pipeline remains inspectable and testable.

Installation

pip install consist

Optional extras:

pip install "consist[ingest]"

[!NOTE] Consist is pre-1.0. The library is ready for real workflows, but minor releases may still include breaking changes while the API continues to settle.

Quick Example

import consist
from pathlib import Path
from consist import ExecutionOptions, Tracker
import pandas as pd

tracker = Tracker(run_dir="./runs", db_path="./provenance.duckdb")


def clean_data(raw: Path, threshold: float = 0.5) -> dict[str, Path]:
    df = pd.read_parquet(raw)
    out = Path("./cleaned.parquet")
    df[df["value"] > threshold].to_parquet(out)
    return {"cleaned": out}


# Executes function and records inputs, config, and output artifact
result = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},  # hashed for cache identity
    config={"threshold": 0.5},  # hashed for cache identity
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

# Second call with identical inputs: instant cache hit, no execution
result = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},
    config={"threshold": 0.5},
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

# Artifact: the output file with provenance metadata attached
artifact = result.outputs["cleaned"]
print(artifact.path)  # -> PosixPath('./cleaned.parquet')

# Load as a DataFrame when needed
cleaned_df = consist.load_df(artifact)

input_binding="paths" keeps the file boundary explicit: the function receives the local Path values named in inputs, while those same inputs still define cache identity and lineage.

When a path-bound step needs inputs at specific local destinations, request staging through ExecutionOptions instead of manually copying files:

result = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},
    outputs=["cleaned"],
    execution_options=ExecutionOptions(
        input_binding="paths",
        input_materialization="requested",
        input_paths={"raw": Path("./workspace/raw.parquet")},
    ),
)

That keeps artifact identity canonical while ensuring the callable sees a real local file at the requested path, including on cache hits.

Summary: Consist computes a deterministic fingerprint from your code version, config, and input files. If you change anything upstream, only affected downstream steps will re-execute.

Multi-Step Pipeline

Dependencies are explicit: the output of one step becomes the input of the next via a concrete reference, not name matching or injection.

def analyze_data(cleaned: Path) -> dict[str, Path]:
    df = pd.read_parquet(cleaned)
    out = Path("./analysis.parquet")
    summary = df.groupby("category")["value"].mean()
    summary.to_parquet(out)
    return {"analysis": out}


preprocess = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},
    config={"threshold": 0.5},
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)
analyze = tracker.run(
    fn=analyze_data,
    inputs={"cleaned": consist.ref(preprocess, key="cleaned")},  # explicit artifact reference
    outputs=["analysis"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

Use output_paths when a function returns None but writes files, or when you need explicit destination control.

Key Features

Deterministic Caching: Cache identity is based on an inspectable fingerprint of code, config, and inputs. Only affected downstream steps re-execute when any upstream piece changes.
Plain Python: Tasks are ordinary Python functions—callable and testable without the tracker. The tracker is additive and does not restructure your code.
Explicit File Workflows: Path-bound steps can request staged local inputs without giving up canonical artifact identity, which is useful for subprocesses, external tools, and workspace-local contracts.
Complete Lineage: Every result is tagged with the exact code and config that created it. Trace lineage from any output back to its sources.
SQL-Native Analysis: All metadata is indexed in DuckDB. Query across runs, join tables, and compare variants using standard SQL.
HPC and Container Support: Run tasks in Docker and Singularity containers, with image digests and mounted volumes included in the cache signature. Ideal for long-running jobs on shared compute.
Queryable CLI: Inspect history, trace lineage, and compare results from the command line after a job completes. No code required.

Documentation Index

Section	Description
Getting Started	5-minute guide to your first tracked run.
Usage Guide	Detailed patterns for scenarios and complex workflows.
Architecture	Deep dive into hashing, lineage, and the DuckDB core.
CLI Reference	Guide to the `consist` command-line tools.
DB Maintenance	Operational runbooks for inspect/doctor/purge/merge/rebuild.
Example Gallery	Interactive notebooks for Monte Carlo, Demand Modeling, etc.

Etymology

In railroad terminology, a consist (noun, pronounced CON-sist) is the specific lineup of locomotives and cars that make up a train. In this library, a consist is the immutable record of exactly which components—code, config, and inputs—were coupled together to produce a result.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.4

May 5, 2026

0.1.3

May 4, 2026

This version

0.1.2

Apr 15, 2026

0.1.1

Apr 8, 2026

0.1.0

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

consist-0.1.2.tar.gz (394.2 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

consist-0.1.2-py3-none-any.whl (442.1 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file consist-0.1.2.tar.gz.

File metadata

Download URL: consist-0.1.2.tar.gz
Upload date: Apr 15, 2026
Size: 394.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for consist-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`7a601f8b7e0dc4b0758716f17405e1628e6dcd78394dc55e0ff75b88a5fcf558`
MD5	`59c0e8b068f58628cd52f9bb702d1769`
BLAKE2b-256	`c68e73b655c7cbd512b722e40e4da8c44a069f1a3fd4b8160cb0ce337988a57e`

See more details on using hashes here.

File details

Details for the file consist-0.1.2-py3-none-any.whl.

File metadata

Download URL: consist-0.1.2-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 442.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for consist-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`908ba1317d9eca8709fce3c0e098a3ead8e4422b7c4f98570e203ba5ef67d194`
MD5	`70dd6ea1cccaf00b01e136a905aaf86f`
BLAKE2b-256	`2f600d0c95d2f8e089e171da22c6ce791ef91f43c7556d3285d12e235abfab23`

See more details on using hashes here.

consist 0.1.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Why Consist?

Installation

Quick Example

Multi-Step Pipeline

Key Features

Documentation Index

Etymology

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes