Skip to main content

Canonical result and measurement data storage APIs for Cogniflow

Project description

cf_datahive

cf_datahive is the Data Hive package boundary for Python-facing APIs/tooling around the canonical data hive root (workspace/<data_hive>).

Boundary (Current Phase)

  • Python package role (sandcastle/cf_datahive): read-oriented API/tooling/validation for pipeline-facing workflows.
  • Native role (sandcastle/cf_datahive/src/cf_datahive/cpp): write gatekeeper and only allowed writer under workspace/data_hive.
  • Step packages must stay thin wrappers and call the native gatekeeper instead of implementing filesystem/parquet helpers.
  • Downstream first-party native consumers must discover the packaged gatekeeper consumer surface through the owner package API instead of repo-relative path reach-in.

Development workflow

  • Current repo bootstrap path is PyPI-first via scripts/fresh_install_v2.ps1; use -EditableModules there when you need local package work.
  • The package can now be built and published independently without changing the read/write ownership boundary above.

Canonical layout

workspace/
  data_hive/
    <pipeline_id>/
      runs/
        <run_id>/
          manifest.json
          tables/
            <table_name>/
              part-0000.parquet
              part-0001.parquet
          artifacts/
            <artifact_name>
      latest.txt
  • latest.txt stores the committed run_id and is updated atomically.
  • manifest.json is the SOT for run metadata, table metadata, file hashes, and artifact hashes.

Usage

from pathlib import Path

from cf_datahive import (
    DataHiveClient,
    cf_datahive_cpp_consumer_cmake_path,
  cf_datahive_cpp_import_library_path,
  cf_datahive_cpp_include_path,
)

workspace_root = Path("workspace")
client = DataHiveClient(str(workspace_root))

runs = client.list_runs("opcua_fifo_avg")
if runs:
    latest = runs[0].run_id
    manifest = client.load_manifest("opcua_fifo_avg", latest)
    table = client.read_table("opcua_fifo_avg", latest, "measurements")
    print(manifest.status, table.num_rows)
    print(cf_datahive_cpp_include_path())
    print(cf_datahive_cpp_import_library_path())
    print(cf_datahive_cpp_consumer_cmake_path())

Native owner API:

  • cf_datahive_cpp_include_path() returns the packaged include root for the native gatekeeper.
  • cf_datahive_cpp_library_path() returns the packaged runtime library path.
  • cf_datahive_cpp_import_library_path() returns the packaged link artifact path that first-party native consumers link against.
  • cf_datahive_cpp_runtime_dir() returns the packaged runtime directory to stage alongside native consumers.
  • cf_datahive_cpp_consumer_cmake_path() returns the owner-provided CMake helper for downstream native consumers that need target import plus runtime staging without re-encoding backend policy.

Native consumer ownership

cf_datahive owns the backend-specific native build, packaging, and runtime policy for cf_datahive_cpp. First-party native consumers should link against that packaged owner surface instead of embedding cf_datahive_cpp sources or carrying their own DuckDB rules.

Typical consumer pattern:

execute_process(
  COMMAND ${Python3_EXECUTABLE} -c "import cf_datahive as d; print(d.cf_datahive_cpp_include_path())"
  OUTPUT_VARIABLE CF_DATAHIVE_CPP_INCLUDE_DIR
  OUTPUT_STRIP_TRAILING_WHITESPACE
)

execute_process(
  COMMAND ${Python3_EXECUTABLE} -c "import cf_datahive as d; print(d.cf_datahive_cpp_library_path())"
  OUTPUT_VARIABLE CF_DATAHIVE_CPP_LIBRARY_PATH
  OUTPUT_STRIP_TRAILING_WHITESPACE
)

execute_process(
  COMMAND ${Python3_EXECUTABLE} -c "import cf_datahive as d; print(d.cf_datahive_cpp_import_library_path())"
  OUTPUT_VARIABLE CF_DATAHIVE_CPP_IMPORT_LIBRARY_PATH
  OUTPUT_STRIP_TRAILING_WHITESPACE
)

include("${Python3_SITEARCH}/cf_datahive/native/cmake/cf_datahive_consumer.cmake")

cf_datahive_import_cpp_target(
  TARGET cf_datahive_cpp
  INCLUDE_DIR "${CF_DATAHIVE_CPP_INCLUDE_DIR}"
  LIBRARY_PATH "${CF_DATAHIVE_CPP_LIBRARY_PATH}"
  IMPORT_LIBRARY_PATH "${CF_DATAHIVE_CPP_IMPORT_LIBRARY_PATH}"
)

cf_datahive_stage_consumer_runtime(
  TARGET my_step_plugin
  RUNTIME_DIR "${Python3_SITEARCH}/cf_datahive/native/bin"
  DESTINATIONS "${CMAKE_CURRENT_SOURCE_DIR}/../bin" "${SKBUILD_PLATLIB_DIR}/my_step_package/bin"
)

DuckDB configuration remains owner-controlled under cf_datahive and moves out of consumer workflows:

  • default mode is static
  • shared mode can be selected with CF_DATAHIVE_CPP_DUCKDB_LINKAGE=shared
  • owner-supported override vars are CF_DATAHIVE_CPP_DUCKDB_INCLUDE, CF_DATAHIVE_CPP_DUCKDB_LIB, CF_DATAHIVE_CPP_DUCKDB_SOURCE, and on Windows CF_DATAHIVE_CPP_DUCKDB_DLL
  • the cf_datahive build/publish workflow is responsible for staging those owner dependencies before packaging the native consumer surface

Manifest details

Each run stores a RunManifest (schema_version="1.0") with:

  • run lifecycle fields (status: staged|committed|aborted)
  • table entries (parquet, schema fingerprint, row/file counts, optional file hashes)
  • artifact entries (sha256, media type, size)
  • optional semantic_refs placeholder map for future ontology links

Schema fingerprint is sha256 of Arrow schema serialization bytes.

Guardrails

Run the repository guardrail check:

python tools/check_datahive_guardrails.py

The script performs C++/header scans and step-package checks that:

  • use canonical workspace/data_hive literals outside the native gatekeeper location (hard fail)
  • violate the thin-steps rule in sandcastle/cf_basic_steps/*/src/*/cpp (hard fail)
  • reintroduce backend-specific ownership in cf_basic_sinks package surfaces (hard fail)

Testing

Install test dependencies and run:

pip install -e "sandcastle/cf_datahive[test]"
pytest -q sandcastle/cf_datahive/tests

Published distribution name:

pip install cf-datahive

Publishing

cf_datahive is published with the dedicated Windows workflow and now owns the packaged native consumer boundary that cf-pipeline-engine links against:

  • Workflow: .github/workflows/cf_datahive_windows_publish.yml
  • Package directory: sandcastle/cf_datahive
  • PyPI tag: cf-datahive-v<version>
  • TestPyPI tag: cf-datahive-v<version>-test

Local preflight:

powershell -ExecutionPolicy Bypass -File scripts/mimic_windows_python_publish_workflow.ps1 `
  -WorkflowFile .github/workflows/cf_datahive_windows_publish.yml `
  -PackageDir sandcastle/cf_datahive `
  -PythonExe py `
  -PythonVersion 3.13

Queue a dry-run dispatch:

powershell -ExecutionPolicy Bypass -File scripts/queue_windows_python_publish_workflow.ps1 `
  -WorkflowFile .github/workflows/cf_datahive_windows_publish.yml `
  -PackageDir sandcastle/cf_datahive `
  -PublishTarget testpypi `
  -Ref main `
  -RequireLocalPass `
  -DryRun

Do / Don't

  • Do: use DataHiveClient read APIs (list_runs, load_manifest, read_table, open_artifact) for inspection and validation.
  • Do: route pipeline write ownership through cf_datahive_cpp in the sink path.
  • Don't: write parquet files or artifacts directly into the canonical data hive root from pipeline steps.
  • Don't: bypass manifest updates.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cf_datahive-0.1.8.tar.gz (25.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cf_datahive-0.1.8-cp313-cp313-win_amd64.whl (11.9 MB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file cf_datahive-0.1.8.tar.gz.

File metadata

  • Download URL: cf_datahive-0.1.8.tar.gz
  • Upload date:
  • Size: 25.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for cf_datahive-0.1.8.tar.gz
Algorithm Hash digest
SHA256 23f69d7be0ae2cd395b4ac69a68d8823fdbf880fb9d76307625675a412c610b3
MD5 b7f802749335267e105ad3ee8384c02f
BLAKE2b-256 7867c23b779ba386c932d07a411c636cc157ed43ca0db3e11426c0d12e4ba6b6

See more details on using hashes here.

File details

Details for the file cf_datahive-0.1.8-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: cf_datahive-0.1.8-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 11.9 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for cf_datahive-0.1.8-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 84a4231e70db7f37e2993cf0c1e27d754dc771ed5243b8f743824ac4abdb6134
MD5 559a184a8ab0f094ad3e2281afba4a73
BLAKE2b-256 19012fcd8f833b1aed2f7976bf56615af80d3614daf03fff70141c8fcb031aec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page