Canonical result and measurement data storage APIs for Cogniflow
Project description
cf_datahive
cf_datahive is the Data Hive package boundary for Python-facing APIs/tooling around the canonical data hive root (workspace/<data_hive>).
Boundary (Current Phase)
- Python package role (
sandcastle/cf_datahive): read-oriented API/tooling/validation for pipeline-facing workflows. - Native role (
sandcastle/cf_datahive/src/cf_datahive/cpp): write gatekeeper and only allowed writer underworkspace/data_hive. - Step packages must stay thin wrappers and call the native gatekeeper instead of implementing filesystem/parquet helpers.
- Downstream first-party native consumers must discover the packaged gatekeeper consumer surface through the owner package API instead of repo-relative path reach-in.
Development workflow
- Current development mode is source-first via
scripts/fresh_install.ps1. - The package can now be built and published independently without changing the read/write ownership boundary above.
Canonical layout
workspace/
data_hive/
<pipeline_id>/
runs/
<run_id>/
manifest.json
tables/
<table_name>/
part-0000.parquet
part-0001.parquet
artifacts/
<artifact_name>
latest.txt
latest.txtstores the committedrun_idand is updated atomically.manifest.jsonis the SOT for run metadata, table metadata, file hashes, and artifact hashes.
Usage
from pathlib import Path
from cf_datahive import (
DataHiveClient,
cf_datahive_cpp_consumer_cmake_path,
cf_datahive_cpp_import_library_path,
cf_datahive_cpp_include_path,
)
workspace_root = Path("workspace")
client = DataHiveClient(str(workspace_root))
runs = client.list_runs("opcua_fifo_avg")
if runs:
latest = runs[0].run_id
manifest = client.load_manifest("opcua_fifo_avg", latest)
table = client.read_table("opcua_fifo_avg", latest, "measurements")
print(manifest.status, table.num_rows)
print(cf_datahive_cpp_include_path())
print(cf_datahive_cpp_import_library_path())
print(cf_datahive_cpp_consumer_cmake_path())
Native owner API:
cf_datahive_cpp_include_path()returns the packaged include root for the native gatekeeper.cf_datahive_cpp_library_path()returns the packaged runtime library path.cf_datahive_cpp_import_library_path()returns the packaged link artifact path that first-party native consumers link against.cf_datahive_cpp_runtime_dir()returns the packaged runtime directory to stage alongside native consumers.cf_datahive_cpp_consumer_cmake_path()returns the owner-provided CMake helper for downstream native consumers that need target import plus runtime staging without re-encoding backend policy.
Native consumer ownership
cf_datahive owns the backend-specific native build, packaging, and runtime policy for cf_datahive_cpp.
First-party native consumers should link against that packaged owner surface instead of embedding cf_datahive_cpp sources or carrying their own DuckDB rules.
Typical consumer pattern:
execute_process(
COMMAND ${Python3_EXECUTABLE} -c "import cf_datahive as d; print(d.cf_datahive_cpp_include_path())"
OUTPUT_VARIABLE CF_DATAHIVE_CPP_INCLUDE_DIR
OUTPUT_STRIP_TRAILING_WHITESPACE
)
execute_process(
COMMAND ${Python3_EXECUTABLE} -c "import cf_datahive as d; print(d.cf_datahive_cpp_library_path())"
OUTPUT_VARIABLE CF_DATAHIVE_CPP_LIBRARY_PATH
OUTPUT_STRIP_TRAILING_WHITESPACE
)
execute_process(
COMMAND ${Python3_EXECUTABLE} -c "import cf_datahive as d; print(d.cf_datahive_cpp_import_library_path())"
OUTPUT_VARIABLE CF_DATAHIVE_CPP_IMPORT_LIBRARY_PATH
OUTPUT_STRIP_TRAILING_WHITESPACE
)
include("${Python3_SITEARCH}/cf_datahive/native/cmake/cf_datahive_consumer.cmake")
cf_datahive_import_cpp_target(
TARGET cf_datahive_cpp
INCLUDE_DIR "${CF_DATAHIVE_CPP_INCLUDE_DIR}"
LIBRARY_PATH "${CF_DATAHIVE_CPP_LIBRARY_PATH}"
IMPORT_LIBRARY_PATH "${CF_DATAHIVE_CPP_IMPORT_LIBRARY_PATH}"
)
cf_datahive_stage_consumer_runtime(
TARGET my_step_plugin
RUNTIME_DIR "${Python3_SITEARCH}/cf_datahive/native/bin"
DESTINATIONS "${CMAKE_CURRENT_SOURCE_DIR}/../bin" "${SKBUILD_PLATLIB_DIR}/my_step_package/bin"
)
DuckDB configuration remains owner-controlled under cf_datahive and moves out of consumer workflows:
- default mode is
static - shared mode can be selected with
CF_DATAHIVE_CPP_DUCKDB_LINKAGE=shared - owner-supported override vars are
CF_DATAHIVE_CPP_DUCKDB_INCLUDE,CF_DATAHIVE_CPP_DUCKDB_LIB,CF_DATAHIVE_CPP_DUCKDB_SOURCE, and on WindowsCF_DATAHIVE_CPP_DUCKDB_DLL - the
cf_datahivebuild/publish workflow is responsible for staging those owner dependencies before packaging the native consumer surface
Manifest details
Each run stores a RunManifest (schema_version="1.0") with:
- run lifecycle fields (
status:staged|committed|aborted) - table entries (
parquet, schema fingerprint, row/file counts, optional file hashes) - artifact entries (sha256, media type, size)
- optional
semantic_refsplaceholder map for future ontology links
Schema fingerprint is sha256 of Arrow schema serialization bytes.
Guardrails
Run the repository guardrail check:
python tools/check_datahive_guardrails.py
The script performs C++/header scans and step-package checks that:
- use canonical
workspace/data_hiveliterals outside the native gatekeeper location (hard fail) - violate the thin-steps rule in
sandcastle/cf_basic_steps/*/src/*/cpp(hard fail) - reintroduce backend-specific ownership in
cf_basic_sinkspackage surfaces (hard fail)
Testing
Install test dependencies and run:
pip install -e "sandcastle/cf_datahive[test]"
pytest -q sandcastle/cf_datahive/tests
Published distribution name:
pip install cf-datahive
Publishing
cf_datahive is published with the dedicated Windows workflow and now owns the packaged native consumer boundary that cf-pipeline-engine links against:
- Workflow:
.github/workflows/cf_datahive_windows_publish.yml - Package directory:
sandcastle/cf_datahive - PyPI tag:
cf-datahive-v<version> - TestPyPI tag:
cf-datahive-v<version>-test
Local preflight:
powershell -ExecutionPolicy Bypass -File scripts/mimic_windows_python_publish_workflow.ps1 `
-WorkflowFile .github/workflows/cf_datahive_windows_publish.yml `
-PackageDir sandcastle/cf_datahive `
-PythonExe py `
-PythonVersion 3.13
Queue a dry-run dispatch:
powershell -ExecutionPolicy Bypass -File scripts/queue_windows_python_publish_workflow.ps1 `
-WorkflowFile .github/workflows/cf_datahive_windows_publish.yml `
-PackageDir sandcastle/cf_datahive `
-PublishTarget testpypi `
-Ref main `
-RequireLocalPass `
-DryRun
Do / Don't
- Do: use
DataHiveClientread APIs (list_runs,load_manifest,read_table,open_artifact) for inspection and validation. - Do: route pipeline write ownership through
cf_datahive_cppin the sink path. - Don't: write parquet files or artifacts directly into the canonical data hive root from pipeline steps.
- Don't: bypass manifest updates.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cf_datahive-0.1.6.tar.gz.
File metadata
- Download URL: cf_datahive-0.1.6.tar.gz
- Upload date:
- Size: 25.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
473d7cb58fa09f5597569e13652de6023c44d5ba98fdb88125ae3e9b4754c651
|
|
| MD5 |
7f76c9ae1ea2833807df93c0546d1d79
|
|
| BLAKE2b-256 |
fc34913d047929d92481767d1497ad9768def2da1d5e8bfad985f8b7929a1671
|
File details
Details for the file cf_datahive-0.1.6-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: cf_datahive-0.1.6-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 11.9 MB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ecbe77f052d819f90a322705f93b8df6749f4aab3316aa42c6060d3626b083f
|
|
| MD5 |
06bdc0aec9a933fe287138bba92cefe9
|
|
| BLAKE2b-256 |
f8a8d82abbc40baf172d481bfc0ced9ab0579717b20302efb8d70bc361e9f02d
|