Data Contract eXtended — AI-native, platform-extensible data contracts: LLM enrichment (descriptions, tags, data quality), live import, and apply. Built on datacontract-cli.
Project description
Data Contract eXtended — AI-native, platform-extensible data contracts
Author data contracts with an LLM, sync them with your live platforms.
A lean, no-fork extension of datacontract-cli, built on the Open Data Contract Standard (ODCS).
What is dcx?
dcx (Data Contract eXtended) adds three things to the Open Data Contract Standard workflow that plain datacontract-cli doesn't do:
- AI authoring — use an LLM to enrich a contract with column descriptions, validation constraints, governance tags from your own catalog, and an executable data-quality suite.
- Live import — build a contract from a running system (its real columns, keys, comments, tags).
- Apply — push the contract's governance back to the platform (comments, tags, data-quality, and the table itself).
It's platform-extensible by design: each platform is a small importer / exporter / apply module that plugs into datacontract-cli's factories. Snowflake is the first end-to-end platform (import → enrich → apply), with Kafka import today and more platforms built to slot in the same way.
The pipeline is: import a live schema into an ODCS contract → enrich it (columns · tags · quality) → apply it back to the platform, or export it to SQL / docs / schemas. Everything is available both as a CLI and as a REST API (dcx api).
Why dcx?
- 🧠 AI authoring that's safe to ship. Forced tool-calling,
temperature=0, and strict server-side validation against the ODCS schema — the model can only produce spec-valid output, never free-form guesses. - 🏷️ A tag manager, not a tag guesser. You define a controlled tag catalog (names, allowed values, examples); the LLM classifies columns into your vocabulary, with optional defaults.
- ✅ Executable, portable data quality. Quality rules prefer ODCS
librarymetrics (portable, mappable to platform-native checks) and fall back to portablesqlchecks — across all seven ODCS dimensions. - 🔌 Any LLM provider. Powered by litellm — Anthropic, OpenAI, Azure, Bedrock, Gemini, Ollama, … behind one
--modelflag. - 🧩 Pluggable platforms, no fork. You keep all 30+ upstream importers/exporters and
lint/test/changelog, and gain the AI + platform layer on top. - 🔐 Auth that makes sense per surface. Live platform operations over the API use caller-supplied OAuth; secrets are never CLI flags.
Install
pip install datacontract-x
The import package and CLI are both dcx:
dcx --help
dcx info
From source (for development):
git clone https://github.com/MickaelBZH/data-contract-x.git
cd data-contract-x
pip install -e ".[dev]"
Requires Python 3.10–3.12. Installing pulls in
datacontract-cli,litellm, FastAPI, and the platform connectors automatically.
Quickstart
The full loop — import a live schema, enrich it with an LLM, sync it back. Snowflake here is the example platform.
# 1. Import an existing schema into a contract (real columns, PKs, comments, tags)
dcx import snowflake --database MY_DB --schema LOAD --authenticator externalbrowser --output contract.yaml
# 2. Enrich with an LLM: descriptions + constraints + tags + data-quality tests
export ANTHROPIC_API_KEY=... # or OPENAI_API_KEY / AZURE_API_KEY / ...
dcx enrich all contract.yaml --catalog tags_catalog.yaml --output contract.enriched.yaml
# 3. Preview exactly what will run — no connection needed
dcx apply snowflake contract.enriched.yaml --include-quality --dry-run
# 4. Apply it: creates the table if missing, governs it (comments + tags + DQ) if it exists
dcx apply snowflake contract.enriched.yaml --include-quality
Commands
Every command is dcx <command>, and most are mirrored to a REST endpoint when you run dcx api. Each section below lists the sub-commands, a CLI example, and the matching API call. Run dcx <command> --help for the full option list.
import — build a contract from a source
| Sub-command | Source |
|---|---|
dcx import snowflake |
A live Snowflake schema (columns, primary keys, comments, tags) |
dcx import kafka |
A Kafka topic's value schema (Confluent Schema Registry) |
dcx import <format> |
A file/document — sql, avro, dbml, glue, bigquery, unity, jsonschema, json, odcs, parquet, csv, protobuf, spark, iceberg, excel, dbt |
dcx import snowflake --database MY_DB --schema LOAD --authenticator externalbrowser --output contract.yaml
dcx import kafka --schema-registry https://sr:8081 --topic orders --output contract.yaml
dcx import sql --source schema.sql --dialect snowflake --output contract.yaml
API
POST /import/snowflake— live import, authenticated by the caller's Snowflake OAuth token (Authorization: Bearer <token>).POST /import/{format}— file-based importers; send the document inline assource_content.- (Kafka import is CLI-only.)
enrich — AI authoring with an LLM
| Sub-command | Adds |
|---|---|
dcx enrich columns |
Business descriptions, logicalTypeOptions constraints, required / unique flags |
dcx enrich tags |
Governance tags, classified against your tag catalog |
dcx enrich quality |
An executable data-quality suite across all ODCS dimensions |
dcx enrich all |
columns → tags → quality, in that order so each stage grounds the next |
Each sub-command is independent and idempotent (existing values are preserved unless you pass --overwrite). The provider key is read from the environment — there is no --api-key flag. Use --model for any litellm model and --base-url for a proxy / Azure / Ollama endpoint.
dcx enrich columns contract.yaml --output contract.enriched.yaml
dcx enrich tags contract.yaml --catalog tags_catalog.yaml --output contract.tagged.yaml
dcx enrich quality contract.yaml --model gpt-4o --output contract.dq.yaml
dcx enrich all contract.yaml --catalog tags_catalog.yaml --output contract.full.yaml
API (the LLM key comes from the server's environment)
POST /enrich/columns·POST /enrich/qualityPOST /enrich/tags·POST /enrich/all— take the tag catalog inline in the request body.
export — convert a contract to a target format
| Sub-command | Output |
|---|---|
dcx export snowflake-full |
A Snowflake setup script: DDL + tags + Data Metric Functions, in one file |
dcx export <format> |
Any upstream format — sql, jsonschema, html, markdown, mermaid, dbt-*, avro, protobuf, bigquery, spark, sqlalchemy, iceberg, sodacl, great-expectations, dbml, pydantic-model, odcs, rdf, go, excel, … |
snowflake-full shares apply's SQL-generation knobs, so it emits the exact same script apply --dry-run would: --ddl-mode auto\|always\|never (default auto → CREATE TABLE IF NOT EXISTS + govern), --structured-types, --comments, --include-tags, --include-quality, --create-tags, --tag-namespace DB.SCHEMA. (apply's --strict drift check has no export equivalent — it needs a live connection.)
dcx export snowflake-full contract.yaml --include-quality --create-tags --output setup.sql
dcx export snowflake-full contract.yaml --ddl-mode never --output govern.sql # alter-only
dcx export html contract.yaml --output contract.html
API
POST /export/{format}— includingPOST /export/snowflake-full. The response media type depends on the format (JSON / YAML / text / binary).
apply — push governance to a live platform
| Sub-command | Target |
|---|---|
dcx apply snowflake |
A live Snowflake account |
With the default --ddl-mode auto you don't need to know whether the table exists: missing tables are created (CREATE TABLE IF NOT EXISTS) and existing ones are governed — column/table comments, tags, and (with --include-quality) data-quality metrics. For existing tables, dcx also compares the live schema to the contract and reports drift as warnings — or, with --strict, an error that aborts before any change (the check uses DESCRIBE TABLE, so it needs no active warehouse).
| Option | Effect |
|---|---|
--ddl-mode auto|always|never |
create-if-missing-then-govern (default) · always CREATE TABLE · govern existing only |
--strict |
fail instead of warn on schema drift |
--structured-types |
typed nested OBJECT(...) / ARRAY(...) |
--include-quality · --create-tags · --tag-namespace |
data-metric functions · CREATE TAG IF NOT EXISTS · qualify tag refs |
--dry-run |
print the SQL without connecting |
dcx apply snowflake contract.yaml --dry-run # preview
dcx apply snowflake contract.yaml --include-quality # create-or-govern
API
POST /apply/snowflake— authenticated by the caller's Snowflake OAuth token. Supportsdry_run,ddl_mode,strict,structured_types, … and returns the executed SQL plus any driftwarnings.
target — bind a contract to a platform
dcx target <type> sets the contract's server block and resolves each column's physicalType for that platform. ~30 types: snowflake, bigquery, databricks, postgres, redshift, mysql, sqlserver, oracle, s3, kafka, trino, athena, glue, duckdb, local, …
dcx target snowflake contract.yaml --output contract.snowflake.yaml
API
POST /target/{type}— one route per supported platform type.
From datacontract-cli
These commands work unchanged — dcx <command> behaves exactly like datacontract <command>.
| Command | Sub-commands | Purpose | API |
|---|---|---|---|
dcx init |
— | Create an empty data contract | — |
dcx lint |
— | Validate a contract against the ODCS schema | POST /lint |
dcx test |
— | Run schema + data-quality tests against a configured server | POST /test |
dcx ci |
— | test for CI/CD — emits GitHub Actions annotations |
— |
dcx changelog |
— | Semantic changelog between two contract versions | POST /changelog |
dcx catalog |
— | Render an HTML catalog of many contracts | — |
dcx publish |
— | Publish a contract to Entropy Data | — |
dcx dbt |
sync |
Sync contracts into a dbt project | — |
api / info
dcx api --port 4242 # start the REST server (Swagger UI at /docs)
dcx info # show dcx + datacontract-cli versions (API: GET /info)
The tag catalog
dcx enrich tags does controlled-vocabulary tagging: instead of letting the model invent tags, you give it a catalog of allowed names and values, and it classifies each column into that vocabulary. The catalog is a small YAML (or JSON) file — the only extra input auto-tagging needs.
# tags_catalog.yaml
tags:
- name: DATA_CLASSIFICATION # the tag name (becomes the platform TAG name)
description: > # tells the model what this tag is for
Data sensitivity level. Assign exactly one — the highest level that applies.
multiple: false # false = at most one value per column; true = many
values:
- value: PUBLIC # the model may only pick from these values
description: Non-sensitive data that can be shared freely.
examples: [country_code, currency, language, product_category] # guide classification
- value: INTERNAL
description: Internal business data, not for public release. The default.
default: true # assigned when the model picks nothing else
examples: [order_id, status, created_at, loyalty_points]
- value: CONFIDENTIAL
description: Personal data or sensitive business data; need-to-know access.
examples: [full_name, email, phone, home_address, date_of_birth]
- value: RESTRICTED
description: Highly sensitive data under legal/regulatory controls (financial, health, credentials, IDs).
examples: [national_id, passport_number, iban, credit_card_number, health_status]
- name: DATA_DOMAIN # you can define several tags
description: The business domain that owns the column.
multiple: false
values:
- value: CUSTOMER
examples: [customer_id, email, loyalty_points]
- value: FINANCE
examples: [amount, currency, invoice_id, iban]
| Field | Meaning |
|---|---|
name |
Tag name. Required. Becomes the tag key everywhere downstream. |
description |
What the tag means — given to the model as classification guidance. |
multiple |
false (default): at most one value per column. true: a column may carry several. |
values[].value |
An allowed value. The model may only assign values listed here — anything else is dropped. |
values[].description |
What the value means — strongly improves accuracy. |
values[].examples |
Example column names that fit this value — the model's strongest signal. |
values[].default |
If true, assigned to columns the model leaves unclassified for this tag. At most one per tag. |
Assigned tags are written on each column as NAME=VALUE (e.g. DATA_CLASSIFICATION=CONFIDENTIAL) — the convention export snowflake-full and apply snowflake consume. A worked catalog and example contracts live in examples/.
REST API
dcx api --port 4242 # Swagger UI at http://127.0.0.1:4242/docs
Every command above is mirrored to an endpoint, with request and response schemas in the OpenAPI spec. Auth model:
- Live platform operations (
/import/snowflake,/apply/snowflake) act as the caller — the OAuth bearer token comes from theAuthorizationheader, so the server never uses ambient credentials for someone else's data. - Enrichment (
/enrich/*) uses the server's LLM key (from the environment). Put service-level auth/quota in front of it before exposing it publicly. - The CLI never takes secrets as flags — platform secrets come from env vars or the platform's own config; LLM keys from the provider's standard env var.
How it fits with datacontract-cli
dcx is a separate package that depends on datacontract-cli as a library — no fork. It registers new importers (snowflake, kafka) and the snowflake-full exporter into the upstream factories, adds target / enrich / apply sub-apps and live-import commands to the upstream Typer app, and mirrors every command to FastAPI for dcx api. So you keep all of upstream's importers, exporters, lint, test, and changelog, and gain the AI + platform layer on top.
Development
pip install -e ".[dev]"
pytest # 211 tests
ruff check dcx # lint
Tests never hit live services or real LLMs — platform connections, the Schema Registry, and every LLM call are mocked, so the suite stays fast and offline. See RELEASING.md for the PyPI release process.
Contributing
Issues and PRs welcome. Please run pytest and ruff check dcx before opening a PR, and add tests for new behavior.
License
MIT © MickaelBZH.
Built on datacontract-cli · Open Data Contract Standard · litellm
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datacontract_x-0.1.2.tar.gz.
File metadata
- Download URL: datacontract_x-0.1.2.tar.gz
- Upload date:
- Size: 96.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bab5780343b94565420f311140f80079072d3502b1ebc707707f52ba18d47f60
|
|
| MD5 |
6086d2f293756559067856ea7ad3f947
|
|
| BLAKE2b-256 |
50379afb80ba49a527e2ab76c7fd99a60889dc494b2f16e5a52d7e14ee9bddc2
|
Provenance
The following attestation bundles were made for datacontract_x-0.1.2.tar.gz:
Publisher:
release.yml on MickaelBZH/data-contract-x
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datacontract_x-0.1.2.tar.gz -
Subject digest:
bab5780343b94565420f311140f80079072d3502b1ebc707707f52ba18d47f60 - Sigstore transparency entry: 1784228584
- Sigstore integration time:
-
Permalink:
MickaelBZH/data-contract-x@d3ddd8809e0c92708288161d19c3e35adcd9a395 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/MickaelBZH
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d3ddd8809e0c92708288161d19c3e35adcd9a395 -
Trigger Event:
release
-
Statement type:
File details
Details for the file datacontract_x-0.1.2-py3-none-any.whl.
File metadata
- Download URL: datacontract_x-0.1.2-py3-none-any.whl
- Upload date:
- Size: 84.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4c66a7df8eeea0bf2812bf9dc7599fe119064e6ed9c7781a29b45436a49e147
|
|
| MD5 |
828e0363176829b164ce615902d7049b
|
|
| BLAKE2b-256 |
2b2442e77dac0931f48c71366c027c68aa2b71e6a133438fcd8d292c3ecae190
|
Provenance
The following attestation bundles were made for datacontract_x-0.1.2-py3-none-any.whl:
Publisher:
release.yml on MickaelBZH/data-contract-x
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datacontract_x-0.1.2-py3-none-any.whl -
Subject digest:
e4c66a7df8eeea0bf2812bf9dc7599fe119064e6ed9c7781a29b45436a49e147 - Sigstore transparency entry: 1784228773
- Sigstore integration time:
-
Permalink:
MickaelBZH/data-contract-x@d3ddd8809e0c92708288161d19c3e35adcd9a395 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/MickaelBZH
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d3ddd8809e0c92708288161d19c3e35adcd9a395 -
Trigger Event:
release
-
Statement type: