Skip to main content

CLI toolkit for validating agent-ready data products from local manifests.

Project description

dataproduct-kit

CI

dataproduct-kit is the open source CI gate for agent-safe data products. It validates contracts, quality checks, freshness, semantic metrics, policy constraints, and evidence outputs before AI agents consume data-product context.

Why this exists

Enterprise data products are increasingly consumed by BI users, platform teams, and AI agents. The hard part is not just finding data; it is knowing whether the data has an owner, a contract, a stable metric definition, quality checks, freshness context, lineage, and policy constraints.

dataproduct-kit makes that trust context explicit and testable in a local repo.

Quickstart after release

For local branch testing before a public release, use the editable install in Develop locally.

pipx install dataproduct-kit
dataproduct-kit init demo demo --template saas-churn
dataproduct-kit ci demo --profile starter
dataproduct-kit report demo --format markdown
dataproduct-kit context demo --metric churn_rate --format json

Bring your own CSV

dataproduct-kit init from-csv data/customers.csv --out data-products/customers
dataproduct-kit doctor data-products/customers
dataproduct-kit ci data-products/customers --profile starter

The CSV scaffold creates starter manifests with inferred columns and TODO governance fields. See docs/from-csv.md for the graduation path from a local starter to a production gate.

Develop locally

python3 -m venv .venv
.venv/bin/python -m pip install -e ".[dev]"

Try the SaaS churn demo

dataproduct-kit init demo demo --template saas-churn
dataproduct-kit ci demo --profile starter
dataproduct-kit report demo --format markdown
dataproduct-kit context demo --metric churn_rate --format json
dataproduct-kit export odcs demo
dataproduct-kit export osi demo
dataproduct-kit emit openlineage demo

You can also validate with machine-readable output:

dataproduct-kit validate demo --format json
dataproduct-kit validate demo --fail-on warn

Expected validation output:

status: pass

The Markdown report starts like this:

# Trust Report: SaaS Churn Data Product

| Field | Value |
| --- | --- |
| Product ID | saas_churn |
| Overall status | pass |

Manifest model

A data product directory contains four source-of-truth files:

  • dataproduct.yaml: product identity, owner, datasets, freshness SLA.
  • contract.yaml: schema fields, classifications, and built-in quality checks.
  • semantic.yaml: approved metrics, dimensions, entities, and expressions.
  • policy.yaml: allowed purposes, sensitive fields, and AI/BI usage constraints.

The bundled demo uses local CSV data and DuckDB, so it needs no cloud account or running database.

Generate JSON Schema for editor integration or manifest authoring:

dataproduct-kit schema dataproduct
dataproduct-kit schema all --out schemas

Validate

dataproduct-kit validate demo

The command exits with 0 for pass or warn, and 1 for fail.

For repository-wide pull request checks, use the CI command:

dataproduct-kit ci . --profile production --format text
dataproduct-kit ci . --profile production --format github --fail-on warn --sarif dataproduct-kit.sarif.json

The CI command discovers every directory containing dataproduct.yaml below the path, validates each data product, emits a suite summary, and can write SARIF for audit evidence or code-scanning upload. Use starter for local onboarding and production for pull request gates; see docs/readiness-profiles.md for the full profile behavior.

Repository defaults can live in dataproduct-kit.toml:

[ci]
include = ["data-products/**"]
exclude = ["data-products/sandbox/**"]
profile = "production"
fail_on = "warn"

You can also use the bundled GitHub Action:

- uses: johnmikel/dataproduct-kit@v0.4.0
  with:
    path: "."
    profile: "production"
    fail-on: "warn"
    format: "github"
    sarif: "dataproduct-kit.sarif.json"

Reports and agent context

dataproduct-kit report demo --format json
dataproduct-kit report demo --format markdown
dataproduct-kit context demo --metric churn_rate --format json

The context command returns metric definition, freshness, policy, and lineage metadata. It deliberately does not answer business questions or generate SQL.

Example context fields:

{
  "metric": {
    "name": "churn_rate",
    "dataset": "subscriptions",
    "grain": "month"
  },
  "quality_status": "pass"
}

Standards outputs

dataproduct-kit export odcs demo
dataproduct-kit export osi demo
dataproduct-kit emit openlineage demo

Exports are standards-aligned from the local profile:

  • ODCS-compatible data contract JSON.
  • OSI-inspired semantic model JSON.
  • OpenLineage-compatible validation event JSONL.

Use --out to write standards exports to files:

dataproduct-kit export odcs demo --out contract.json
dataproduct-kit export osi demo --out semantic.json

What this catches

The validator returns fail for issues such as:

  • Missing required columns.
  • Values that cannot cast to the declared contract type.
  • Null or blank values in non-nullable fields.
  • Failed quality checks such as uniqueness, accepted values, and row count.
  • Stale data based on the dataset freshness SLA.
  • Metrics that reference unknown dimensions or invalid expressions.
  • Policy fields that reference columns not declared in the contract.

Runnable examples live under examples/:

  • examples/pass/saas-churn
  • examples/fail/schema-drift
  • examples/fail/stale-data
  • examples/fail/broken-metric
  • examples/fail/policy-gap

See docs/usage-scenarios.md for concrete usage scenarios. See docs/ci-adoption.md for pull request gate setup, docs/readiness-profiles.md for profile behavior, docs/from-csv.md for CSV onboarding, docs/json-output.md for the stable automation contract, docs/finding-codes.md for stable finding codes, and docs/suppressions.md for expiring exceptions. See docs/compatibility.md for supported automation surfaces and docs/ci-rollout.md for a staged production rollout. Maintainer release notes live in docs/publishing.md and docs/release-checklist.md.

Project status

This is v0.4-alpha. The local CLI and SaaS churn demo are usable, but the manifest profile and standards exports may change before a stable release.

See ROADMAP.md for planned standards depth, ecosystem adapters, and agent/platform integrations.

Verify

.venv/bin/python -m pytest
.venv/bin/python -m ruff check .
.venv/bin/python -m pip check
./scripts/verify.sh

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataproduct_kit-0.4.0.tar.gz (68.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataproduct_kit-0.4.0-py3-none-any.whl (35.5 kB view details)

Uploaded Python 3

File details

Details for the file dataproduct_kit-0.4.0.tar.gz.

File metadata

  • Download URL: dataproduct_kit-0.4.0.tar.gz
  • Upload date:
  • Size: 68.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dataproduct_kit-0.4.0.tar.gz
Algorithm Hash digest
SHA256 0097cdd2ef827588c04137678b7ea1008c471205ee2d709d6a057f0fa5eb0ea8
MD5 13c2f9af62ad6a2dd858f795e0687962
BLAKE2b-256 25aeb2ab934c79aa6f099a87c3a436fc624ed179575f6a9225f9357d7d0d6793

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataproduct_kit-0.4.0.tar.gz:

Publisher: publish.yml on johnmikel/dataproduct-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dataproduct_kit-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: dataproduct_kit-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 35.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dataproduct_kit-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0b5193a8b4ba8995aa87d042b31582160d6c60c1f6fde4440b150c1fef0741f4
MD5 9fbee56331675ffbbaa8f60c9b64e10b
BLAKE2b-256 748230e503153eeffcaa1b16cc4660af786c9d032be225ecea414d441ac33652

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataproduct_kit-0.4.0-py3-none-any.whl:

Publisher: publish.yml on johnmikel/dataproduct-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page