Add a verbatim regulatory citation layer (GDPR / HIPAA / EHDS / CRA) to open-data-contract-standard (ODCS) contracts.
Project description
data-contract-cite
Add a verbatim regulatory citation layer to your open-data-contract-standard (ODCS) contracts.
data-contract-cite reads an ODCS
contract and, for every field, emits a manifest that binds the field to the
specific article(s) of GDPR, HIPAA, EHDS and the EU Cyber Resilience Act (CRA)
that govern it — quoted verbatim, with article ID, official source URL, and a
SHA-256 chain so an auditor can prove the manifest has not drifted.
pip install data-contract-cite
dc-cite annotate \
--contract contract.yaml \
--regimes gdpr,hipaa \
--output annotated/
That is the whole product. If you have an ODCS YAML/JSON contract and you ship
data in a regulated industry, the output of dc-cite is the artifact a DPO,
compliance officer or external auditor wants to see.
Why this exists
Data engineers in healthcare, fintech and EU companies are converging on data contracts (ODCS, dbt contracts, OpenMetadata schemas, Soda checks) as the source of truth for what a dataset is, who owns it, what quality it guarantees. The Bitol Foundation's open-data-contract-standard is rapidly becoming the de-facto YAML/JSON schema for this.
What ODCS does not do — and the broader data-quality tooling (Soda, Great Expectations, Monte Carlo, OpenMetadata) also does not do — is tell you which regulation governs each field. In a regulated industry that gap is the entire conversation with your auditor:
"You have
patient.medical_record_numberin this contract — under what regulation is it personal data, and where does that regulation say so?"
Today the answer lives in a separate Confluence page maintained by compliance, drifts away from the contract within weeks, and is the first thing a regulator picks apart.
data-contract-cite closes that gap by treating regulatory citations as a
first-class artifact generated from the contract itself, with the same
review/CI discipline as the contract.
What it does
Given an ODCS contract like:
# contract.yaml (ODCS v3)
apiVersion: v3.0.0
kind: DataContract
id: patient-records-v1
schema:
- name: patient
properties:
- name: email
logicalType: string
physicalType: varchar(255)
tags: [pii, contact]
- name: medical_record_number
logicalType: string
physicalType: varchar(64)
tags: [phi, identifier]
- name: diagnosis_code
logicalType: string
physicalType: varchar(16)
tags: [phi, health-data]
dc-cite annotate --contract contract.yaml --regimes gdpr,hipaa writes:
annotated/patient-records-v1.manifest.yaml— per-field citation manifestannotated/patient-records-v1.manifest.sha256— content hash chainannotated/patient-records-v1.contract.yaml— original contract, untouched
A snippet of the manifest:
contract_id: patient-records-v1
generator: data-contract-cite/0.1.0
fields:
- path: patient.email
citations:
- regime: GDPR
article: "Art. 4(1)"
url: https://eur-lex.europa.eu/eli/reg/2016/679/oj
verbatim: |
'personal data' means any information relating to an identified
or identifiable natural person ('data subject') ...
- path: patient.medical_record_number
citations:
- regime: GDPR
article: "Art. 4(1)"
...
- regime: HIPAA
article: "45 CFR §164.514(b)(2)(i)(H)"
url: https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-164
verbatim: |
Medical record numbers ... must be removed for de-identification
under the Safe Harbor method.
- path: patient.diagnosis_code
citations:
- regime: GDPR
article: "Art. 9(1)"
...
verbatim: |
Processing of personal data revealing ... data concerning health
... shall be prohibited.
Three CLI subcommands:
| Command | Purpose |
|---|---|
dc-cite annotate |
Generate per-contract citation manifest. |
dc-cite validate |
Verify a manifest's SHA chain still matches its source contract. |
dc-cite manifest |
Dump the bundled citation-rules database (provenance audit). |
Differentiator
| Tool | What it covers | Citation per field? |
|---|---|---|
| bitol-io/open-data-contract-standard | Schema, quality, SLAs, ownership | No |
| OpenMetadata | Catalog, lineage, classification tags | No (free-text PII tag) |
| Soda / Great Expectations | Runtime quality checks | No |
| Monte Carlo / Bigeye | Observability, anomaly detection | No |
| OneTrust / Collibra | Enterprise GRC, policy mapping | Yes, but per-table and €€€ |
| data-contract-cite | Per-field verbatim citation against GDPR / HIPAA / EHDS / CRA | Yes |
The OSS data-contract / catalog ecosystem is schema- and quality-focused. The commercial GRC ecosystem cites regulation but at policy-document granularity and at enterprise pricing. Nothing in the middle gives a data engineer a free, MIT-licensed, CI-runnable tool that binds each contract field to the verbatim regulatory clause that makes that field a regulated entity.
Install
pip install data-contract-cite
dc-cite --version
Requires Python ≥ 3.10. Pure Python — pyyaml + pydantic only.
Pricing
- CLI / library: MIT-licensed, free, forever.
- Hosted CI (planned):
dc-cite-ciGitHub App that runs on every PR touching a contract, comments the diff of regulatory citations and blocks the merge if a field acquires a new regime without sign-off. €19/mo per repo, €49/mo for organizations. Stripe billing.
The free CLI is the whole product. Hosted CI exists for teams that don't want to wire it into their own pipeline.
Regimes covered (v0.1)
| Regime | Source | Coverage |
|---|---|---|
| GDPR | Regulation (EU) 2016/679 | Art. 4(1), 4(13–15), 9(1), 32, 33, 35 |
| HIPAA | 45 CFR §164 | §164.514(b)(2) Safe Harbor identifiers, §164.312 technical safeguards |
| EHDS | Regulation (EU) 2025/327 | Arts. on electronic health data (placeholder, expands as final text stabilises) |
| CRA | Regulation (EU) 2024/2847 | Annex I essential cybersecurity requirements for products with digital elements |
The rule database is src/data_contract_cite/data/citation_rules.yaml — open
for PRs that add jurisdictions (UK GDPR, LGPD, CCPA, PIPEDA, APPI).
Audit chain
Every manifest carries:
generator— tool name and version that produced the manifestsource_sha256— SHA-256 of the source ODCS contractrules_sha256— SHA-256 ofcitation_rules.yamlat generation timemanifest_sha256— SHA-256 of the manifest body (excluding this field)
dc-cite validate manifest.yaml --contract contract.yaml recomputes all
three and exits 0 only if every hash matches. That single command is what
your CI runs to prove the manifest is current.
Status
Alpha. v0.1 ships with ~15 rules covering the most common
PII/PHI/health-data field patterns. The bottleneck on coverage is regulatory
review of new rules, not engineering — PRs against citation_rules.yaml
that include the verbatim clause + source URL are the fastest path to
broader regime coverage.
This project is independent of and not endorsed by the Bitol Foundation.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_contract_cite-0.1.0.tar.gz.
File metadata
- Download URL: data_contract_cite-0.1.0.tar.gz
- Upload date:
- Size: 25.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df1c87cfcb43cb97f7f69e912aceba4ecfcb58889dbc2dd289b08c3b1fa42bb2
|
|
| MD5 |
78bfb0364027ea657d33a7716c0f3d25
|
|
| BLAKE2b-256 |
531fce4dad0162de5a6aa791b8eb9905a3829dbbdf7ece407438773683573fa0
|
File details
Details for the file data_contract_cite-0.1.0-py3-none-any.whl.
File metadata
- Download URL: data_contract_cite-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc427370606f78c08c178078a1ba4506ef264156d59b48f4b811f4766cd59bac
|
|
| MD5 |
3b1c81c3d0414feee69666058ff5f8f1
|
|
| BLAKE2b-256 |
f8e7b8d9b8c86c319189167b6052af3463432eedd5aee387dd29ef1ad67e7293
|