Provenance tracking, intelligent caching, and data virtualization for scientific simulation workflows.
Project description
Consist is a caching layer for scientific simulation workflows that makes provenance queryable. It automatically records what code, configuration, and input data produced each output in your pipeline—eliminating redundant computation and enabling post-hoc inspection of results via SQL.
Why Consist?
Multi-run simulation workflows typically accumulate friction:
- Provenance ambiguity: "Which configuration produced those results in Figure 3?"
- Redundant computation: Re-running a 4-hour pipeline because you changed one unrelated parameter.
- Scattered outputs: Finding and comparing results across scenario variants manually.
- Hidden wiring: Tools with implicit dependencies (name-based injection, global state) are hard to debug and modify when something breaks.
Consist tracks lineage explicitly. Tasks are ordinary Python functions; dependencies flow through concrete values, not framework magic. Your pipeline remains inspectable and testable.
Installation
pip install consist
Optional extras:
pip install "consist[parquet]"
pip install "consist[ingest]"
[!NOTE] Consist is pre-
1.0. The library is ready for real workflows, but minor releases may still include breaking changes while the API continues to settle.
Quick Example
import consist
from pathlib import Path
from consist import ExecutionOptions, Tracker
import pandas as pd
tracker = Tracker(run_dir="./runs", db_path="./provenance.duckdb")
def clean_data(raw: Path, threshold: float = 0.5) -> dict[str, Path]:
df = pd.read_parquet(raw)
out = Path("./cleaned.parquet")
df[df["value"] > threshold].to_parquet(out)
return {"cleaned": out}
# Executes function and records inputs, config, and output artifact
result = tracker.run(
fn=clean_data,
inputs={"raw": Path("raw.parquet")}, # hashed for cache identity
config={"threshold": 0.5}, # hashed for cache identity
outputs=["cleaned"],
execution_options=ExecutionOptions(input_binding="paths"),
)
# Second call with identical inputs: instant cache hit, no execution
result = tracker.run(
fn=clean_data,
inputs={"raw": Path("raw.parquet")},
config={"threshold": 0.5},
outputs=["cleaned"],
execution_options=ExecutionOptions(input_binding="paths"),
)
# Artifact: the output file with provenance metadata attached
artifact = result.outputs["cleaned"]
print(artifact.path) # -> PosixPath('./cleaned.parquet')
# Load as a DataFrame when needed
cleaned_df = consist.load_df(artifact)
input_binding="paths" keeps the file boundary explicit: the function receives
the local Path values named in inputs, while those same inputs still define
cache identity and lineage.
Summary: Consist computes a deterministic fingerprint from your code version, config, and input files. If you change anything upstream, only affected downstream steps will re-execute.
Multi-Step Pipeline
Dependencies are explicit: the output of one step becomes the input of the next via a concrete reference, not name matching or injection.
def analyze_data(cleaned: Path) -> dict[str, Path]:
df = pd.read_parquet(cleaned)
out = Path("./analysis.parquet")
summary = df.groupby("category")["value"].mean()
summary.to_parquet(out)
return {"analysis": out}
preprocess = tracker.run(
fn=clean_data,
inputs={"raw": Path("raw.parquet")},
config={"threshold": 0.5},
outputs=["cleaned"],
execution_options=ExecutionOptions(input_binding="paths"),
)
analyze = tracker.run(
fn=analyze_data,
inputs={"cleaned": consist.ref(preprocess, key="cleaned")}, # explicit artifact reference
outputs=["analysis"],
execution_options=ExecutionOptions(input_binding="paths"),
)
Use output_paths when a function returns None but writes files, or when you need explicit destination control.
Key Features
- Deterministic Caching: Cache identity is based on an inspectable fingerprint of code, config, and inputs. Only affected downstream steps re-execute when any upstream piece changes.
- Plain Python: Tasks are ordinary Python functions—callable and testable without the tracker. The tracker is additive and does not restructure your code.
- Complete Lineage: Every result is tagged with the exact code and config that created it. Trace lineage from any output back to its sources.
- SQL-Native Analysis: All metadata is indexed in DuckDB. Query across runs, join tables, and compare variants using standard SQL.
- HPC and Container Support: Track Docker and Singularity containers as pure functions—image digests and mounted volumes become part of the cache signature. Ideal for long-running jobs on shared compute.
- Queryable CLI: Inspect history, trace lineage, and compare results from the command line after a job completes. No code required.
Documentation Index
| Section | Description |
|---|---|
| Getting Started | 5-minute guide to your first tracked run. |
| Usage Guide | Detailed patterns for scenarios and complex workflows. |
| Architecture | Deep dive into hashing, lineage, and the DuckDB core. |
| CLI Reference | Guide to the consist command-line tools. |
| DB Maintenance | Operational runbooks for inspect/doctor/purge/merge/rebuild. |
| Example Gallery | Interactive notebooks for Monte Carlo, Demand Modeling, etc. |
Etymology
In railroad terminology, a consist (noun, pronounced CON-sist) is the specific lineup of locomotives and cars that make up a train. In this library, a consist is the immutable record of exactly which components—code, config, and inputs—were coupled together to produce a result.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file consist-0.1.0.tar.gz.
File metadata
- Download URL: consist-0.1.0.tar.gz
- Upload date:
- Size: 332.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c97dd42c5fc291d78299299a570e12ff010382c34e8f4688943cfbe89eddab7d
|
|
| MD5 |
5d5aea5b1e208250e21aa7744907a2e6
|
|
| BLAKE2b-256 |
b374b3abcb81f87812e71a3250fee131c6a3a9f5e8ee15e932073ae75d098da2
|
File details
Details for the file consist-0.1.0-py3-none-any.whl.
File metadata
- Download URL: consist-0.1.0-py3-none-any.whl
- Upload date:
- Size: 374.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c7fecf8685d0f554f33c546b20327f76d83c62e1390f5fe8093e9770af6283e
|
|
| MD5 |
15be79f5bcedabdd30dd6294e87fefb5
|
|
| BLAKE2b-256 |
b294c328f761dea1c709f2d032133aa6b6e43a82cab46c716d24e23bb67c9194
|