Skip to main content

Provenance tracking, intelligent caching, and data virtualization for scientific simulation workflows.

Project description

Consist

CI Python 3.11+ License BSD 3-Clause

Consist is a caching layer for scientific simulation workflows that makes provenance queryable. It automatically records what code, configuration, and input data produced each output in your pipeline—eliminating redundant computation and enabling post-hoc inspection of results via SQL.

Why Consist?

Multi-run simulation workflows typically accumulate friction:

  • Provenance ambiguity: "Which configuration produced those results in Figure 3?"
  • Redundant computation: Re-running a 4-hour pipeline because you changed one unrelated parameter.
  • Scattered outputs: Finding and comparing results across scenario variants manually.
  • Hidden wiring: Tools with implicit dependencies (name-based injection, global state) are hard to debug and modify when something breaks.

Consist tracks lineage explicitly. Tasks are ordinary Python functions; dependencies flow through concrete values, not framework magic. Your pipeline remains inspectable and testable.


Installation

pip install consist

Optional extras:

pip install "consist[ingest]"

[!NOTE] Consist is pre-1.0. The library is ready for real workflows, but minor releases may still include breaking changes while the API continues to settle.


Quick Example

import consist
from pathlib import Path
from consist import ExecutionOptions, Tracker
import pandas as pd

tracker = Tracker(run_dir="./runs", db_path="./provenance.duckdb")


def clean_data(raw: Path, threshold: float = 0.5) -> dict[str, Path]:
    df = pd.read_parquet(raw)
    out = Path("./cleaned.parquet")
    df[df["value"] > threshold].to_parquet(out)
    return {"cleaned": out}


# Executes function and records inputs, config, and output artifact
result = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},  # hashed for cache identity
    config={"threshold": 0.5},  # hashed for cache identity
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

# Second call with identical inputs: instant cache hit, no execution
result = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},
    config={"threshold": 0.5},
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

# Artifact: the output file with provenance metadata attached
artifact = result.outputs["cleaned"]
print(artifact.path)  # -> PosixPath('./cleaned.parquet')

# Load as a DataFrame when needed
cleaned_df = consist.load_df(artifact)

input_binding="paths" keeps the file boundary explicit: the function receives the local Path values named in inputs, while those same inputs still define cache identity and lineage.

When a path-bound step needs inputs at specific local destinations, request staging through ExecutionOptions instead of manually copying files:

result = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},
    outputs=["cleaned"],
    execution_options=ExecutionOptions(
        input_binding="paths",
        input_materialization="requested",
        input_paths={"raw": Path("./workspace/raw.parquet")},
    ),
)

That keeps artifact identity canonical while ensuring the callable sees a real local file at the requested path, including on cache hits.

Summary: Consist computes a deterministic fingerprint from your code version, config, and input files. If you change anything upstream, only affected downstream steps will re-execute.

Multi-Step Pipeline

Dependencies are explicit: the output of one step becomes the input of the next via a concrete reference, not name matching or injection.

def analyze_data(cleaned: Path) -> dict[str, Path]:
    df = pd.read_parquet(cleaned)
    out = Path("./analysis.parquet")
    summary = df.groupby("category")["value"].mean()
    summary.to_parquet(out)
    return {"analysis": out}


preprocess = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},
    config={"threshold": 0.5},
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)
analyze = tracker.run(
    fn=analyze_data,
    inputs={"cleaned": consist.ref(preprocess, key="cleaned")},  # explicit artifact reference
    outputs=["analysis"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

Use output_paths when a function returns None but writes files, or when you need explicit destination control.


Key Features

  • Deterministic Caching: Cache identity is based on an inspectable fingerprint of code, config, and inputs. Only affected downstream steps re-execute when any upstream piece changes.
  • Plain Python: Tasks are ordinary Python functions—callable and testable without the tracker. The tracker is additive and does not restructure your code.
  • Explicit File Workflows: Path-bound steps can request staged local inputs without giving up canonical artifact identity, which is useful for subprocesses, external tools, and workspace-local contracts.
  • Complete Lineage: Every result is tagged with the exact code and config that created it. Trace lineage from any output back to its sources.
  • SQL-Native Analysis: All metadata is indexed in DuckDB. Query across runs, join tables, and compare variants using standard SQL.
  • HPC and Container Support: Run tasks in Docker and Singularity containers, with image digests and mounted volumes included in the cache signature. Ideal for long-running jobs on shared compute.
  • Queryable CLI: Inspect history, trace lineage, and compare results from the command line after a job completes. No code required.

Documentation Index

Section Description
Getting Started 5-minute guide to your first tracked run.
Usage Guide Detailed patterns for scenarios and complex workflows.
Architecture Deep dive into hashing, lineage, and the DuckDB core.
CLI Reference Guide to the consist command-line tools.
DB Maintenance Operational runbooks for inspect/doctor/purge/merge/rebuild.
Example Gallery Interactive notebooks for Monte Carlo, Demand Modeling, etc.

Etymology

In railroad terminology, a consist (noun, pronounced CON-sist) is the specific lineup of locomotives and cars that make up a train. In this library, a consist is the immutable record of exactly which components—code, config, and inputs—were coupled together to produce a result.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

consist-0.1.2.tar.gz (394.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

consist-0.1.2-py3-none-any.whl (442.1 kB view details)

Uploaded Python 3

File details

Details for the file consist-0.1.2.tar.gz.

File metadata

  • Download URL: consist-0.1.2.tar.gz
  • Upload date:
  • Size: 394.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for consist-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7a601f8b7e0dc4b0758716f17405e1628e6dcd78394dc55e0ff75b88a5fcf558
MD5 59c0e8b068f58628cd52f9bb702d1769
BLAKE2b-256 c68e73b655c7cbd512b722e40e4da8c44a069f1a3fd4b8160cb0ce337988a57e

See more details on using hashes here.

File details

Details for the file consist-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: consist-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 442.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for consist-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 908ba1317d9eca8709fce3c0e098a3ead8e4422b7c4f98570e203ba5ef67d194
MD5 70dd6ea1cccaf00b01e136a905aaf86f
BLAKE2b-256 2f600d0c95d2f8e089e171da22c6ce791ef91f43c7556d3285d12e235abfab23

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page