Skip to main content

Open-source provenance SDK and specification for verifiable EO and climate data workflows

Project description

trazaeo

PyPI crates.io

trazaeo is a Python-first package for adding verifiable provenance to Earth observation and climate data workflows. It gives you fast content hashing, signed provenance envelopes, and workflow helpers you can drop into an existing pipeline without replacing your current scheduler, storage layer, or transform code.

Use it when you want to:

  • hash outputs from an existing batch or streaming job
  • attach provenance to a dataset publish step
  • verify that a delivered artifact still matches the published record
  • add provenance checks around netCDF, Zarr, or Icechunk workflows

Install

For most users:

pip install trazaeo

If you also want the optional netCDF, xarray, and Zarr helpers used by the example workflows:

pip install 'trazaeo[python-examples]'

Published wheels are built as CPython abi3 artifacts from Python 3.12, so a single wheel works across Python 3.12+ on the supported platforms below. If a prebuilt wheel is not available for your platform, pip will fall back to building from source.

Published wheel contract

The package metadata supports Python 3.12+.

The verified published wheel matrix is:

  • CPython abi3 built from Python 3.12
  • Linux manylinux_2_28_x86_64
  • macOS x86_64
  • macOS arm64
  • Windows x86_64
  • import trazaeo
  • import trazaeo_workflows.dataset_provenance
  • from trazaeo import PublicRpcSolanaProofLogAdaptor
  • trazaeo-icechunk --help

Source-build fallback

If you install on a platform outside that wheel matrix, pip will build from source. In Debian/Ubuntu-style environments, install a C/Rust build toolchain and Python development headers for the interpreter you are using:

apt-get update
apt-get install -y build-essential curl pkg-config python3-dev
curl https://sh.rustup.rs -sSf | sh -s -- -y

Then restart the shell and rerun:

pip install trazaeo

Use It Inside Your Existing Pipeline

trazaeo is designed to sit at the boundaries of work you already do.

Typical places to add it:

  • after a transform job writes a file, hash the artifact and store the content root with your job metadata
  • before publishing a dataset, build and sign provenance for the output and its source inputs
  • during delivery or audit, verify that the local artifact still matches the published checkpoint

You do not need to adopt a new pipeline framework. The package works well as:

  • a Python helper inside an Airflow, Prefect, Dagster, or Argo task
  • a provenance step called from an existing batch job or notebook
  • a verification step in a release or data publication workflow

Quick Start

The normal integration point is the Python API. A common first step is to hash an artifact right after your pipeline writes it:

from trazaeo import blake3_content_root


def register_pipeline_output(path: str) -> dict[str, str]:
    content_root = blake3_content_root(path, 4096, 4).hex()
    return {
        "artifact_path": path,
        "content_root_hash": content_root,
    }

That works well in an Airflow task, a Prefect flow, a Dagster asset, or a plain Python batch job. You keep your existing transform code and add one provenance step after the file is produced.

For in-memory content:

from trazaeo import blake3_hash, blake3_hash_mt

single = blake3_hash(b"hello world").hex()
parallel = blake3_hash_mt(b"hello world", 4).hex()

Artifact Verification In Process

If your pipeline publishes an artifact and later needs to verify what was delivered, you can build a proof package for the local file:

from trazaeo_workflows import build_local_artifact_full_root_proof_package


def build_local_artifact_proof(path: str) -> dict:
    return build_local_artifact_full_root_proof_package(
        path,
        chunk_size=1 << 20,
        threads=4,
    )

And when you already have a delivery proof package from an upstream publish step, verify it against the artifact path:

from trazaeo_workflows import verify_dataset_delivery_proof_report


def verify_delivery(path: str, delivery_proof_package: dict) -> dict:
    return verify_dataset_delivery_proof_report(
        delivery_proof_package,
        artifact_path=path,
    )

This fits naturally in a downstream validation, QA, or publication check step.

Dataset Publish Workflows

The higher-level trazaeo_workflows helpers are for pipelines that already track their source files, transform job ids, output artifact refs, signer ids, and verification policy. In that case, you pass your existing metadata into trazaeo and let it build the provenance bundle around work your pipeline already performed.

The main Python workflow entrypoints are:

  • trazaeo_workflows.build_dataset_bootstrap_bundle
  • trazaeo_workflows.build_dataset_incremental_bundle
  • trazaeo_workflows.build_dataset_delivery_proof_package
  • trazaeo_workflows.verify_dataset_delivery_proof_report

Those helpers are used by the example netCDF and Icechunk flows in examples/python_netcdf/.

A typical pattern is:

  1. Your pipeline reads or transforms source files.
  2. Your pipeline writes the dataset artifact.
  3. You hash the artifact with trazaeo.
  4. You pass the source metadata, output metadata, signer, and trust policy into a dataset workflow helper.
  5. You store or publish the returned provenance bundle beside the dataset.

Documentation

  • Project docs: https://endcorp-hq.github.io/provenance
  • Python workflow examples: examples/python_netcdf/README.md
  • Protocol spec: TRAZAEO_V1_SPEC.md
  • Architecture boundary: docs/contracts/architecture.md
  • Quality gates: docs/contracts/quality-gates.md
  • Rust crate overview: trazaeo/README.md

Development

Most users only need pip install trazaeo. If you are contributing to this repository, see CONTRIBUTING.md for local build, test, extension, and docs workflows.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trazaeo-0.5.5.tar.gz (115.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trazaeo-0.5.5-cp312-abi3-manylinux_2_28_x86_64.whl (3.5 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ x86-64

File details

Details for the file trazaeo-0.5.5.tar.gz.

File metadata

  • Download URL: trazaeo-0.5.5.tar.gz
  • Upload date:
  • Size: 115.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.12

File hashes

Hashes for trazaeo-0.5.5.tar.gz
Algorithm Hash digest
SHA256 9a21101a43e2a5939bf928755cf4df71bcb4e951643612b5372e0c47ae225e3f
MD5 964172b1fb36af47b4a413940780660c
BLAKE2b-256 23913b0a7ae14ea3fcca6d4daca91e917c64b4eabe828592f25620d445d0c288

See more details on using hashes here.

File details

Details for the file trazaeo-0.5.5-cp312-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for trazaeo-0.5.5-cp312-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ce3de62bdc9b8600a91e636e745729b46ec8190c5f9f991bc257f234a0986bcd
MD5 086cd449b2156e9cfe0c360a4824ef4f
BLAKE2b-256 82f301bed9d3a3ca6adc34056fcc2ed11a2330b0c7b91d78c838209cd03ec6ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page