Skip to main content

Provenance tracking, intelligent caching, and data virtualization for scientific simulation workflows.

Project description

Consist

CI Python 3.11+ License BSD 3-Clause

Consist is a caching layer for scientific simulation workflows that makes provenance queryable. It automatically records what code, configuration, and input data produced each output in your pipeline—eliminating redundant computation and enabling post-hoc inspection of results via SQL.

Why Consist?

Multi-run simulation workflows typically accumulate friction:

  • Provenance ambiguity: "Which configuration produced those results in Figure 3?"
  • Redundant computation: Re-running a 4-hour pipeline because you changed one unrelated parameter.
  • Scattered outputs: Finding and comparing results across scenario variants manually.
  • Hidden wiring: Tools with implicit dependencies (name-based injection, global state) are hard to debug and modify when something breaks.

Consist tracks lineage explicitly. Tasks are ordinary Python functions; dependencies flow through concrete values, not framework magic. Your pipeline remains inspectable and testable.


Installation

pip install consist

Optional extras:

pip install "consist[parquet]"
pip install "consist[ingest]"

[!NOTE] Consist is pre-1.0. The library is ready for real workflows, but minor releases may still include breaking changes while the API continues to settle.


Quick Example

import consist
from pathlib import Path
from consist import ExecutionOptions, Tracker
import pandas as pd

tracker = Tracker(run_dir="./runs", db_path="./provenance.duckdb")


def clean_data(raw: Path, threshold: float = 0.5) -> dict[str, Path]:
    df = pd.read_parquet(raw)
    out = Path("./cleaned.parquet")
    df[df["value"] > threshold].to_parquet(out)
    return {"cleaned": out}


# Executes function and records inputs, config, and output artifact
result = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},  # hashed for cache identity
    config={"threshold": 0.5},  # hashed for cache identity
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

# Second call with identical inputs: instant cache hit, no execution
result = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},
    config={"threshold": 0.5},
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

# Artifact: the output file with provenance metadata attached
artifact = result.outputs["cleaned"]
print(artifact.path)  # -> PosixPath('./cleaned.parquet')

# Load as a DataFrame when needed
cleaned_df = consist.load_df(artifact)

input_binding="paths" keeps the file boundary explicit: the function receives the local Path values named in inputs, while those same inputs still define cache identity and lineage.

Summary: Consist computes a deterministic fingerprint from your code version, config, and input files. If you change anything upstream, only affected downstream steps will re-execute.

Multi-Step Pipeline

Dependencies are explicit: the output of one step becomes the input of the next via a concrete reference, not name matching or injection.

def analyze_data(cleaned: Path) -> dict[str, Path]:
    df = pd.read_parquet(cleaned)
    out = Path("./analysis.parquet")
    summary = df.groupby("category")["value"].mean()
    summary.to_parquet(out)
    return {"analysis": out}


preprocess = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},
    config={"threshold": 0.5},
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)
analyze = tracker.run(
    fn=analyze_data,
    inputs={"cleaned": consist.ref(preprocess, key="cleaned")},  # explicit artifact reference
    outputs=["analysis"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

Use output_paths when a function returns None but writes files, or when you need explicit destination control.


Key Features

  • Deterministic Caching: Cache identity is based on an inspectable fingerprint of code, config, and inputs. Only affected downstream steps re-execute when any upstream piece changes.
  • Plain Python: Tasks are ordinary Python functions—callable and testable without the tracker. The tracker is additive and does not restructure your code.
  • Complete Lineage: Every result is tagged with the exact code and config that created it. Trace lineage from any output back to its sources.
  • SQL-Native Analysis: All metadata is indexed in DuckDB. Query across runs, join tables, and compare variants using standard SQL.
  • HPC and Container Support: Track Docker and Singularity containers as pure functions—image digests and mounted volumes become part of the cache signature. Ideal for long-running jobs on shared compute.
  • Queryable CLI: Inspect history, trace lineage, and compare results from the command line after a job completes. No code required.

Documentation Index

Section Description
Getting Started 5-minute guide to your first tracked run.
Usage Guide Detailed patterns for scenarios and complex workflows.
Architecture Deep dive into hashing, lineage, and the DuckDB core.
CLI Reference Guide to the consist command-line tools.
DB Maintenance Operational runbooks for inspect/doctor/purge/merge/rebuild.
Example Gallery Interactive notebooks for Monte Carlo, Demand Modeling, etc.

Etymology

In railroad terminology, a consist (noun, pronounced CON-sist) is the specific lineup of locomotives and cars that make up a train. In this library, a consist is the immutable record of exactly which components—code, config, and inputs—were coupled together to produce a result.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

consist-0.1.0.tar.gz (332.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

consist-0.1.0-py3-none-any.whl (374.5 kB view details)

Uploaded Python 3

File details

Details for the file consist-0.1.0.tar.gz.

File metadata

  • Download URL: consist-0.1.0.tar.gz
  • Upload date:
  • Size: 332.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for consist-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c97dd42c5fc291d78299299a570e12ff010382c34e8f4688943cfbe89eddab7d
MD5 5d5aea5b1e208250e21aa7744907a2e6
BLAKE2b-256 b374b3abcb81f87812e71a3250fee131c6a3a9f5e8ee15e932073ae75d098da2

See more details on using hashes here.

File details

Details for the file consist-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: consist-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 374.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for consist-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7c7fecf8685d0f554f33c546b20327f76d83c62e1390f5fe8093e9770af6283e
MD5 15be79f5bcedabdd30dd6294e87fefb5
BLAKE2b-256 b294c328f761dea1c709f2d032133aa6b6e43a82cab46c716d24e23bb67c9194

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page