Skip to main content

Canonical result and measurement data storage APIs for Cogniflow

Project description

cf_datahive

cf_datahive is the Data Hive package boundary for Python-facing APIs/tooling around the canonical data hive root (workspace/<data_hive>).

Boundary (Current Phase)

  • Python package role (sandcastle/cf_datahive): read-oriented API/tooling/validation for pipeline-facing workflows.
  • Native role (sandcastle/cf_datahive/src/cf_datahive/cpp): write gatekeeper and only allowed writer under workspace/data_hive.
  • Step packages must stay thin wrappers and call the native gatekeeper instead of implementing filesystem/parquet helpers.
  • Downstream first-party native consumers must discover the gatekeeper source surface through the owner package API instead of repo-relative path reach-in.

Development workflow

  • Current development mode is source-first via scripts/fresh_install.ps1.
  • The package can now be built and published independently without changing the read/write ownership boundary above.

Canonical layout

workspace/
  data_hive/
    <pipeline_id>/
      runs/
        <run_id>/
          manifest.json
          tables/
            <table_name>/
              part-0000.parquet
              part-0001.parquet
          artifacts/
            <artifact_name>
      latest.txt
  • latest.txt stores the committed run_id and is updated atomically.
  • manifest.json is the SOT for run metadata, table metadata, file hashes, and artifact hashes.

Usage

from pathlib import Path

from cf_datahive import DataHiveClient, cf_datahive_cpp_source_path

workspace_root = Path("workspace")
client = DataHiveClient(str(workspace_root))

runs = client.list_runs("opcua_fifo_avg")
if runs:
    latest = runs[0].run_id
    manifest = client.load_manifest("opcua_fifo_avg", latest)
    table = client.read_table("opcua_fifo_avg", latest, "measurements")
    print(manifest.status, table.num_rows)
    print(cf_datahive_cpp_source_path())

Native owner API:

  • cf_datahive_cpp_source_path() returns the installed/package-owned native source root used by first-party build consumers such as cf_basic_sinks.
  • cf_datahive_cpp_include_path() returns the include root inside that native source tree.

Manifest details

Each run stores a RunManifest (schema_version="1.0") with:

  • run lifecycle fields (status: staged|committed|aborted)
  • table entries (parquet, schema fingerprint, row/file counts, optional file hashes)
  • artifact entries (sha256, media type, size)
  • optional semantic_refs placeholder map for future ontology links

Schema fingerprint is sha256 of Arrow schema serialization bytes.

Guardrails

Run the repository guardrail check:

python tools/check_datahive_guardrails.py

The script performs C++/header scans and step-package checks that:

  • use canonical workspace/data_hive literals outside the native gatekeeper location (hard fail)
  • violate the thin-steps rule in sandcastle/cf_basic_steps/*/src/*/cpp (hard fail)

Testing

Install test dependencies and run:

pip install -e "sandcastle/cf_datahive[test]"
pytest -q sandcastle/cf_datahive/tests

Published distribution name:

pip install cf-datahive

Publishing

cf_datahive is published with the dedicated Windows workflow:

  • Workflow: .github/workflows/cf_datahive_windows_publish.yml
  • Package directory: sandcastle/cf_datahive
  • PyPI tag: cf-datahive-v<version>
  • TestPyPI tag: cf-datahive-v<version>-test

Local preflight:

powershell -ExecutionPolicy Bypass -File scripts/mimic_windows_python_publish_workflow.ps1 `
  -WorkflowFile .github/workflows/cf_datahive_windows_publish.yml `
  -PackageDir sandcastle/cf_datahive `
  -PythonExe py `
  -PythonVersion 3.13

Queue a dry-run dispatch:

powershell -ExecutionPolicy Bypass -File scripts/queue_windows_python_publish_workflow.ps1 `
  -WorkflowFile .github/workflows/cf_datahive_windows_publish.yml `
  -PackageDir sandcastle/cf_datahive `
  -PublishTarget testpypi `
  -Ref main `
  -RequireLocalPass `
  -DryRun

Do / Don't

  • Do: use DataHiveClient read APIs (list_runs, load_manifest, read_table, open_artifact) for inspection and validation.
  • Do: route pipeline write ownership through cf_datahive_cpp in the sink path.
  • Don't: write parquet files or artifacts directly into the canonical data hive root from pipeline steps.
  • Don't: bypass manifest updates.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cf_datahive-0.1.1.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cf_datahive-0.1.1-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file cf_datahive-0.1.1.tar.gz.

File metadata

  • Download URL: cf_datahive-0.1.1.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for cf_datahive-0.1.1.tar.gz
Algorithm Hash digest
SHA256 cfdbb46800cc3c30774a70a8fdf26ae10a8daa8127b3c9b32f6aae47c6385a57
MD5 a8497ca37d34e567df14935615754132
BLAKE2b-256 0bd9df64746f098fe250be5e368ffbaf708c2344f4d55bffe1a555d97ac2b8ba

See more details on using hashes here.

File details

Details for the file cf_datahive-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: cf_datahive-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for cf_datahive-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9ead65e75fb69d736dc0b1868cbff8d06924e585c820b5d3ad4fe4df55e3dec4
MD5 bbc30cadd88e62362020d79de5c4ab7e
BLAKE2b-256 561488d06083cfca0e55585f3a3c42452e05f3bd16266678d33964f82d5353cc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page