Canonical result and measurement data storage APIs for Cogniflow

Project description

cf_datahive

cf_datahive is the Data Hive package boundary for Python-facing APIs/tooling around the canonical data hive root (workspace/<data_hive>).

Boundary (Current Phase)

Python package role (sandcastle/cf_datahive): read-oriented API/tooling/validation for pipeline-facing workflows.
Native role (sandcastle/cf_datahive/src/cf_datahive/cpp): write gatekeeper and only allowed writer under workspace/data_hive.
Step packages must stay thin wrappers and call the native gatekeeper instead of implementing filesystem/parquet helpers.
Downstream first-party native consumers must discover the gatekeeper source surface through the owner package API instead of repo-relative path reach-in.

Development workflow

Current development mode is source-first via scripts/fresh_install.ps1.
The package can now be built and published independently without changing the read/write ownership boundary above.

Canonical layout

workspace/
  data_hive/
    <pipeline_id>/
      runs/
        <run_id>/
          manifest.json
          tables/
            <table_name>/
              part-0000.parquet
              part-0001.parquet
          artifacts/
            <artifact_name>
      latest.txt

latest.txt stores the committed run_id and is updated atomically.
manifest.json is the SOT for run metadata, table metadata, file hashes, and artifact hashes.

Usage

from pathlib import Path

from cf_datahive import (
    DataHiveClient,
    cf_datahive_cpp_consumer_cmake_path,
    cf_datahive_cpp_source_path,
)

workspace_root = Path("workspace")
client = DataHiveClient(str(workspace_root))

runs = client.list_runs("opcua_fifo_avg")
if runs:
    latest = runs[0].run_id
    manifest = client.load_manifest("opcua_fifo_avg", latest)
    table = client.read_table("opcua_fifo_avg", latest, "measurements")
    print(manifest.status, table.num_rows)
    print(cf_datahive_cpp_source_path())
    print(cf_datahive_cpp_consumer_cmake_path())

Native owner API:

cf_datahive_cpp_source_path() returns the installed/package-owned native source root used by first-party build consumers such as cf_basic_sinks.
cf_datahive_cpp_include_path() returns the include root inside that native source tree.
cf_datahive_cpp_consumer_cmake_path() returns the owner-provided CMake helper for downstream native consumers that need runtime staging without re-encoding backend policy.

Native consumer ownership

cf_datahive owns the backend-specific native build and runtime policy for cf_datahive_cpp. First-party step packages should consume that owner surface instead of carrying their own DuckDB rules.

Typical consumer pattern:

execute_process(
  COMMAND ${Python3_EXECUTABLE} -c "import cf_datahive as d; print(d.cf_datahive_cpp_source_path())"
  OUTPUT_VARIABLE CF_DATAHIVE_CPP_SOURCE_DIR
  OUTPUT_STRIP_TRAILING_WHITESPACE
)

add_subdirectory(${CF_DATAHIVE_CPP_SOURCE_DIR} ${CMAKE_CURRENT_BINARY_DIR}/cf_datahive_cpp_build)

cf_datahive_stage_consumer_runtime(
  TARGET my_step_plugin
  DESTINATIONS
    "${CMAKE_CURRENT_SOURCE_DIR}/../bin"
    "${SKBUILD_PLATLIB_DIR}/my_step_package/bin"
)

DuckDB configuration remains owner-controlled under cf_datahive:

default mode is static
shared mode can be selected with CF_DATAHIVE_CPP_DUCKDB_LINKAGE=shared
owner-supported override vars are CF_DATAHIVE_CPP_DUCKDB_INCLUDE, CF_DATAHIVE_CPP_DUCKDB_LIB, CF_DATAHIVE_CPP_DUCKDB_SOURCE, and on Windows CF_DATAHIVE_CPP_DUCKDB_DLL
when no override vars are set, cf_datahive searches for a repo-local .native_deps/duckdb by walking upward from the consuming CMake source tree before falling back to the owner package tree

Manifest details

Each run stores a RunManifest (schema_version="1.0") with:

run lifecycle fields (status: staged|committed|aborted)
table entries (parquet, schema fingerprint, row/file counts, optional file hashes)
artifact entries (sha256, media type, size)
optional semantic_refs placeholder map for future ontology links

Schema fingerprint is sha256 of Arrow schema serialization bytes.

Guardrails

Run the repository guardrail check:

python tools/check_datahive_guardrails.py

The script performs C++/header scans and step-package checks that:

use canonical workspace/data_hive literals outside the native gatekeeper location (hard fail)
violate the thin-steps rule in sandcastle/cf_basic_steps/*/src/*/cpp (hard fail)
reintroduce backend-specific ownership in cf_basic_sinks package surfaces (hard fail)

Testing

Install test dependencies and run:

pip install -e "sandcastle/cf_datahive[test]"
pytest -q sandcastle/cf_datahive/tests

Published distribution name:

pip install cf-datahive

Publishing

cf_datahive is published with the dedicated Windows workflow:

Workflow: .github/workflows/cf_datahive_windows_publish.yml
Package directory: sandcastle/cf_datahive
PyPI tag: cf-datahive-v<version>
TestPyPI tag: cf-datahive-v<version>-test

Local preflight:

powershell -ExecutionPolicy Bypass -File scripts/mimic_windows_python_publish_workflow.ps1 `
  -WorkflowFile .github/workflows/cf_datahive_windows_publish.yml `
  -PackageDir sandcastle/cf_datahive `
  -PythonExe py `
  -PythonVersion 3.13

Queue a dry-run dispatch:

powershell -ExecutionPolicy Bypass -File scripts/queue_windows_python_publish_workflow.ps1 `
  -WorkflowFile .github/workflows/cf_datahive_windows_publish.yml `
  -PackageDir sandcastle/cf_datahive `
  -PublishTarget testpypi `
  -Ref main `
  -RequireLocalPass `
  -DryRun

Do / Don't

Do: use DataHiveClient read APIs (list_runs, load_manifest, read_table, open_artifact) for inspection and validation.
Do: route pipeline write ownership through cf_datahive_cpp in the sink path.
Don't: write parquet files or artifacts directly into the canonical data hive root from pipeline steps.
Don't: bypass manifest updates.

Project details

Release history Release notifications | RSS feed

0.1.10

Mar 20, 2026

0.1.9

Mar 17, 2026

0.1.8

Mar 14, 2026

0.1.7

Mar 14, 2026

0.1.6

Mar 11, 2026

This version

0.1.5

Mar 11, 2026

0.1.3

Mar 9, 2026

0.1.2

Mar 5, 2026

0.1.1

Mar 4, 2026

0.1.0

Mar 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cf_datahive-0.1.5.tar.gz (25.5 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cf_datahive-0.1.5-py3-none-any.whl (22.2 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file cf_datahive-0.1.5.tar.gz.

File metadata

Download URL: cf_datahive-0.1.5.tar.gz
Upload date: Mar 11, 2026
Size: 25.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for cf_datahive-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`c6e1f4249f6f4574150d781e92fb41233bf59a3244d10e58000ae82b5a9f3b02`
MD5	`2c7dffeea549b4327aee8f417083073b`
BLAKE2b-256	`2f311df51a6ffa139e958b9bf44a8dc80d0333aaef1fb0dff989e59f6b2e2a9d`

See more details on using hashes here.

File details

Details for the file cf_datahive-0.1.5-py3-none-any.whl.

File metadata

Download URL: cf_datahive-0.1.5-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 22.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for cf_datahive-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`26611f5fcdb8e3e2d52202e431933e375ecda7eee1ad44dbfc577c79c8a4196b`
MD5	`d77ed768edcc91fb3dd7abaf0657463d`
BLAKE2b-256	`ff5e0cc70b984207b7f534e0492c9cdf24a362eb83599ae97a09e1d209ab53da`

See more details on using hashes here.

cf-datahive 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

cf_datahive

Boundary (Current Phase)

Development workflow

Canonical layout

Usage

Native consumer ownership

Manifest details

Guardrails

Testing

Publishing

Do / Don't

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes