CLI toolkit for validating agent-ready data products from local manifests.
Project description
dataproduct-kit
dataproduct-kit is the open source CI gate for agent-safe data products.
It validates contracts, quality checks, freshness, semantic metrics, policy
constraints, and evidence outputs before AI agents consume data-product context.
Why this exists
Enterprise data products are increasingly consumed by BI users, platform teams, and AI agents. The hard part is not just finding data; it is knowing whether the data has an owner, a contract, a stable metric definition, quality checks, freshness context, lineage, and policy constraints.
dataproduct-kit makes that trust context explicit and testable in a local repo.
Quickstart after release
For local branch testing before a public release, use the editable install in Develop locally.
pipx install dataproduct-kit
dataproduct-kit init demo demo --template saas-churn
dataproduct-kit ci demo --profile starter
dataproduct-kit report demo --format markdown
dataproduct-kit context demo --metric churn_rate --format json
Bring your own CSV
dataproduct-kit init from-csv data/customers.csv --out data-products/customers
dataproduct-kit doctor data-products/customers
dataproduct-kit ci data-products/customers --profile starter
The CSV scaffold creates starter manifests with inferred columns and TODO governance fields. See docs/from-csv.md for the graduation path from a local starter to a production gate.
Develop locally
python3 -m venv .venv
.venv/bin/python -m pip install -e ".[dev]"
Try the SaaS churn demo
dataproduct-kit init demo demo --template saas-churn
dataproduct-kit ci demo --profile starter
dataproduct-kit report demo --format markdown
dataproduct-kit context demo --metric churn_rate --format json
dataproduct-kit export odcs demo
dataproduct-kit export osi demo
dataproduct-kit emit openlineage demo
You can also validate with machine-readable output:
dataproduct-kit validate demo --format json
dataproduct-kit validate demo --fail-on warn
Expected validation output:
status: pass
The Markdown report starts like this:
# Trust Report: SaaS Churn Data Product
| Field | Value |
| --- | --- |
| Product ID | saas_churn |
| Overall status | pass |
Manifest model
A data product directory contains four source-of-truth files:
dataproduct.yaml: product identity, owner, datasets, freshness SLA.contract.yaml: schema fields, classifications, and built-in quality checks.semantic.yaml: approved metrics, dimensions, entities, and expressions.policy.yaml: allowed purposes, sensitive fields, and AI/BI usage constraints.
The bundled demo uses local CSV data and DuckDB, so it needs no cloud account or running database.
Generate JSON Schema for editor integration or manifest authoring:
dataproduct-kit schema dataproduct
dataproduct-kit schema all --out schemas
Validate
dataproduct-kit validate demo
The command exits with 0 for pass or warn, and 1 for fail.
For repository-wide pull request checks, use the CI command:
dataproduct-kit ci . --profile production --format text
dataproduct-kit ci . --profile production --format github --fail-on warn --sarif dataproduct-kit.sarif.json
The CI command discovers every directory containing dataproduct.yaml below the
path, validates each data product, emits a suite summary, and can write SARIF for
audit evidence or code-scanning upload. Use starter for local onboarding and
production for pull request gates; see
docs/readiness-profiles.md for the full profile
behavior.
Repository defaults can live in dataproduct-kit.toml:
[ci]
include = ["data-products/**"]
exclude = ["data-products/sandbox/**"]
profile = "production"
fail_on = "warn"
You can also use the bundled GitHub Action:
- uses: johnmikel/dataproduct-kit@v0.4.0
with:
path: "."
profile: "production"
fail-on: "warn"
format: "github"
sarif: "dataproduct-kit.sarif.json"
Reports and agent context
dataproduct-kit report demo --format json
dataproduct-kit report demo --format markdown
dataproduct-kit context demo --metric churn_rate --format json
The context command returns metric definition, freshness, policy, and lineage metadata. It deliberately does not answer business questions or generate SQL.
Example context fields:
{
"metric": {
"name": "churn_rate",
"dataset": "subscriptions",
"grain": "month"
},
"quality_status": "pass"
}
Standards outputs
dataproduct-kit export odcs demo
dataproduct-kit export osi demo
dataproduct-kit emit openlineage demo
Exports are standards-aligned from the local profile:
- ODCS-compatible data contract JSON.
- OSI-inspired semantic model JSON.
- OpenLineage-compatible validation event JSONL.
Use --out to write standards exports to files:
dataproduct-kit export odcs demo --out contract.json
dataproduct-kit export osi demo --out semantic.json
What this catches
The validator returns fail for issues such as:
- Missing required columns.
- Values that cannot cast to the declared contract type.
- Null or blank values in non-nullable fields.
- Failed quality checks such as uniqueness, accepted values, and row count.
- Stale data based on the dataset freshness SLA.
- Metrics that reference unknown dimensions or invalid expressions.
- Policy fields that reference columns not declared in the contract.
Runnable examples live under examples/:
examples/pass/saas-churnexamples/fail/schema-driftexamples/fail/stale-dataexamples/fail/broken-metricexamples/fail/policy-gap
See docs/usage-scenarios.md for concrete usage scenarios. See docs/ci-adoption.md for pull request gate setup, docs/readiness-profiles.md for profile behavior, docs/from-csv.md for CSV onboarding, docs/json-output.md for the stable automation contract, docs/finding-codes.md for stable finding codes, and docs/suppressions.md for expiring exceptions. See docs/compatibility.md for supported automation surfaces and docs/ci-rollout.md for a staged production rollout. Maintainer release notes live in docs/publishing.md and docs/release-checklist.md.
Project status
This is v0.4-alpha. The local CLI and SaaS churn demo are usable, but the
manifest profile and standards exports may change before a stable release.
See ROADMAP.md for planned standards depth, ecosystem adapters, and agent/platform integrations.
Verify
.venv/bin/python -m pytest
.venv/bin/python -m ruff check .
.venv/bin/python -m pip check
./scripts/verify.sh
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataproduct_kit-0.4.0.tar.gz.
File metadata
- Download URL: dataproduct_kit-0.4.0.tar.gz
- Upload date:
- Size: 68.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0097cdd2ef827588c04137678b7ea1008c471205ee2d709d6a057f0fa5eb0ea8
|
|
| MD5 |
13c2f9af62ad6a2dd858f795e0687962
|
|
| BLAKE2b-256 |
25aeb2ab934c79aa6f099a87c3a436fc624ed179575f6a9225f9357d7d0d6793
|
Provenance
The following attestation bundles were made for dataproduct_kit-0.4.0.tar.gz:
Publisher:
publish.yml on johnmikel/dataproduct-kit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataproduct_kit-0.4.0.tar.gz -
Subject digest:
0097cdd2ef827588c04137678b7ea1008c471205ee2d709d6a057f0fa5eb0ea8 - Sigstore transparency entry: 2007682806
- Sigstore integration time:
-
Permalink:
johnmikel/dataproduct-kit@e445f8081879bcf45c63d6bb238dc6d05c8021c2 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/johnmikel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e445f8081879bcf45c63d6bb238dc6d05c8021c2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file dataproduct_kit-0.4.0-py3-none-any.whl.
File metadata
- Download URL: dataproduct_kit-0.4.0-py3-none-any.whl
- Upload date:
- Size: 35.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b5193a8b4ba8995aa87d042b31582160d6c60c1f6fde4440b150c1fef0741f4
|
|
| MD5 |
9fbee56331675ffbbaa8f60c9b64e10b
|
|
| BLAKE2b-256 |
748230e503153eeffcaa1b16cc4660af786c9d032be225ecea414d441ac33652
|
Provenance
The following attestation bundles were made for dataproduct_kit-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on johnmikel/dataproduct-kit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataproduct_kit-0.4.0-py3-none-any.whl -
Subject digest:
0b5193a8b4ba8995aa87d042b31582160d6c60c1f6fde4440b150c1fef0741f4 - Sigstore transparency entry: 2007682941
- Sigstore integration time:
-
Permalink:
johnmikel/dataproduct-kit@e445f8081879bcf45c63d6bb238dc6d05c8021c2 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/johnmikel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e445f8081879bcf45c63d6bb238dc6d05c8021c2 -
Trigger Event:
release
-
Statement type: