Skip to main content

Provenance tracking, intelligent caching, and data virtualization for scientific simulation workflows.

Project description

Consist

CI Python 3.11+ License BSD 3-Clause

Consist is a caching layer for scientific simulation workflows that makes provenance queryable. It automatically records what code, configuration, and input data produced each output in your pipeline—eliminating redundant computation and enabling post-hoc inspection of results via SQL.

Why Consist?

Multi-run simulation workflows typically accumulate friction:

  • Provenance ambiguity: "Which configuration produced those results in Figure 3?"
  • Redundant computation: Re-running a 4-hour pipeline because you changed one unrelated parameter.
  • Scattered outputs: Finding and comparing results across scenario variants manually.
  • Hidden wiring: Tools with implicit dependencies (name-based injection, global state) are hard to debug and modify when something breaks.

Consist tracks lineage explicitly. Tasks are ordinary Python functions; dependencies flow through concrete values, not framework magic. Your pipeline remains inspectable and testable.


Installation

pip install consist

Optional extras:

pip install "consist[ingest]"

[!NOTE] Consist is pre-1.0. The library is ready for real workflows, but minor releases may still include breaking changes while the API continues to settle.


Quick Example

import consist
from pathlib import Path
from consist import ExecutionOptions, Tracker
import pandas as pd

tracker = Tracker(run_dir="./runs", db_path="./provenance.duckdb")


def clean_data(raw: Path, threshold: float = 0.5) -> dict[str, Path]:
    df = pd.read_parquet(raw)
    out = Path("./cleaned.parquet")
    df[df["value"] > threshold].to_parquet(out)
    return {"cleaned": out}


# Executes function and records inputs, config, and output artifact
result = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},  # hashed for cache identity
    config={"threshold": 0.5},  # hashed for cache identity
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

# Second call with identical inputs: instant cache hit, no execution
result = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},
    config={"threshold": 0.5},
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

# Artifact: the output file with provenance metadata attached
artifact = result.outputs["cleaned"]
print(artifact.path)  # -> PosixPath('./cleaned.parquet')

# Load as a DataFrame when needed
cleaned_df = consist.load_df(artifact)

input_binding="paths" keeps the file boundary explicit: the function receives the local Path values named in inputs, while those same inputs still define cache identity and lineage.

Summary: Consist computes a deterministic fingerprint from your code version, config, and input files. If you change anything upstream, only affected downstream steps will re-execute.

Multi-Step Pipeline

Dependencies are explicit: the output of one step becomes the input of the next via a concrete reference, not name matching or injection.

def analyze_data(cleaned: Path) -> dict[str, Path]:
    df = pd.read_parquet(cleaned)
    out = Path("./analysis.parquet")
    summary = df.groupby("category")["value"].mean()
    summary.to_parquet(out)
    return {"analysis": out}


preprocess = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},
    config={"threshold": 0.5},
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)
analyze = tracker.run(
    fn=analyze_data,
    inputs={"cleaned": consist.ref(preprocess, key="cleaned")},  # explicit artifact reference
    outputs=["analysis"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

Use output_paths when a function returns None but writes files, or when you need explicit destination control.


Key Features

  • Deterministic Caching: Cache identity is based on an inspectable fingerprint of code, config, and inputs. Only affected downstream steps re-execute when any upstream piece changes.
  • Plain Python: Tasks are ordinary Python functions—callable and testable without the tracker. The tracker is additive and does not restructure your code.
  • Complete Lineage: Every result is tagged with the exact code and config that created it. Trace lineage from any output back to its sources.
  • SQL-Native Analysis: All metadata is indexed in DuckDB. Query across runs, join tables, and compare variants using standard SQL.
  • HPC and Container Support: Run tasks in Docker and Singularity containers, with image digests and mounted volumes included in the cache signature. Ideal for long-running jobs on shared compute.
  • Queryable CLI: Inspect history, trace lineage, and compare results from the command line after a job completes. No code required.

Documentation Index

Section Description
Getting Started 5-minute guide to your first tracked run.
Usage Guide Detailed patterns for scenarios and complex workflows.
Architecture Deep dive into hashing, lineage, and the DuckDB core.
CLI Reference Guide to the consist command-line tools.
DB Maintenance Operational runbooks for inspect/doctor/purge/merge/rebuild.
Example Gallery Interactive notebooks for Monte Carlo, Demand Modeling, etc.

Etymology

In railroad terminology, a consist (noun, pronounced CON-sist) is the specific lineup of locomotives and cars that make up a train. In this library, a consist is the immutable record of exactly which components—code, config, and inputs—were coupled together to produce a result.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

consist-0.1.1.tar.gz (363.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

consist-0.1.1-py3-none-any.whl (407.1 kB view details)

Uploaded Python 3

File details

Details for the file consist-0.1.1.tar.gz.

File metadata

  • Download URL: consist-0.1.1.tar.gz
  • Upload date:
  • Size: 363.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for consist-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e3dd0f6876627ea63cf0b9187f122c3b75c002ea1c12369c4a04bef1063a9b61
MD5 739bd4dad04bb0857c9781197236f8ad
BLAKE2b-256 eca22cea5416159559f7592fcc5220753715542df28c756b01db69aa78624421

See more details on using hashes here.

File details

Details for the file consist-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: consist-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 407.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for consist-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e1133e405361c40e766d13cdd9b5ecf1ba50d23b5a469336a1cf40b04d7a92aa
MD5 d1467dcb163cefb90e7b2ff29e74f835
BLAKE2b-256 520f45be7b19cc3805d50687b02dde2135e5b4cf64b8fdb2d87f9e67c44eea6c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page