Skip to main content

Add a verbatim regulatory citation layer (GDPR / HIPAA / EHDS / CRA) to open-data-contract-standard (ODCS) contracts.

Project description

data-contract-cite

Add a verbatim regulatory citation layer to your open-data-contract-standard (ODCS) contracts.

data-contract-cite reads an ODCS contract and, for every field, emits a manifest that binds the field to the specific article(s) of GDPR, HIPAA, EHDS and the EU Cyber Resilience Act (CRA) that govern it — quoted verbatim, with article ID, official source URL, and a SHA-256 chain so an auditor can prove the manifest has not drifted.

pip install data-contract-cite

dc-cite annotate \
  --contract contract.yaml \
  --regimes gdpr,hipaa \
  --output annotated/

That is the whole product. If you have an ODCS YAML/JSON contract and you ship data in a regulated industry, the output of dc-cite is the artifact a DPO, compliance officer or external auditor wants to see.


Why this exists

Data engineers in healthcare, fintech and EU companies are converging on data contracts (ODCS, dbt contracts, OpenMetadata schemas, Soda checks) as the source of truth for what a dataset is, who owns it, what quality it guarantees. The Bitol Foundation's open-data-contract-standard is rapidly becoming the de-facto YAML/JSON schema for this.

What ODCS does not do — and the broader data-quality tooling (Soda, Great Expectations, Monte Carlo, OpenMetadata) also does not do — is tell you which regulation governs each field. In a regulated industry that gap is the entire conversation with your auditor:

"You have patient.medical_record_number in this contract — under what regulation is it personal data, and where does that regulation say so?"

Today the answer lives in a separate Confluence page maintained by compliance, drifts away from the contract within weeks, and is the first thing a regulator picks apart.

data-contract-cite closes that gap by treating regulatory citations as a first-class artifact generated from the contract itself, with the same review/CI discipline as the contract.


What it does

Given an ODCS contract like:

# contract.yaml (ODCS v3)
apiVersion: v3.0.0
kind: DataContract
id: patient-records-v1
schema:
  - name: patient
    properties:
      - name: email
        logicalType: string
        physicalType: varchar(255)
        tags: [pii, contact]
      - name: medical_record_number
        logicalType: string
        physicalType: varchar(64)
        tags: [phi, identifier]
      - name: diagnosis_code
        logicalType: string
        physicalType: varchar(16)
        tags: [phi, health-data]

dc-cite annotate --contract contract.yaml --regimes gdpr,hipaa writes:

  • annotated/patient-records-v1.manifest.yaml — per-field citation manifest
  • annotated/patient-records-v1.manifest.sha256 — content hash chain
  • annotated/patient-records-v1.contract.yaml — original contract, untouched

A snippet of the manifest:

contract_id: patient-records-v1
generator: data-contract-cite/0.1.0
fields:
  - path: patient.email
    citations:
      - regime: GDPR
        article: "Art. 4(1)"
        url: https://eur-lex.europa.eu/eli/reg/2016/679/oj
        verbatim: |
          'personal data' means any information relating to an identified
          or identifiable natural person ('data subject') ...
  - path: patient.medical_record_number
    citations:
      - regime: GDPR
        article: "Art. 4(1)"
        ...
      - regime: HIPAA
        article: "45 CFR §164.514(b)(2)(i)(H)"
        url: https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-164
        verbatim: |
          Medical record numbers ... must be removed for de-identification
          under the Safe Harbor method.
  - path: patient.diagnosis_code
    citations:
      - regime: GDPR
        article: "Art. 9(1)"
        ...
        verbatim: |
          Processing of personal data revealing ... data concerning health
          ... shall be prohibited.

Three CLI subcommands:

Command Purpose
dc-cite annotate Generate per-contract citation manifest.
dc-cite validate Verify a manifest's SHA chain still matches its source contract.
dc-cite manifest Dump the bundled citation-rules database (provenance audit).

Differentiator

Tool What it covers Citation per field?
bitol-io/open-data-contract-standard Schema, quality, SLAs, ownership No
OpenMetadata Catalog, lineage, classification tags No (free-text PII tag)
Soda / Great Expectations Runtime quality checks No
Monte Carlo / Bigeye Observability, anomaly detection No
OneTrust / Collibra Enterprise GRC, policy mapping Yes, but per-table and €€€
data-contract-cite Per-field verbatim citation against GDPR / HIPAA / EHDS / CRA Yes

The OSS data-contract / catalog ecosystem is schema- and quality-focused. The commercial GRC ecosystem cites regulation but at policy-document granularity and at enterprise pricing. Nothing in the middle gives a data engineer a free, MIT-licensed, CI-runnable tool that binds each contract field to the verbatim regulatory clause that makes that field a regulated entity.


Install

pip install data-contract-cite
dc-cite --version

Requires Python ≥ 3.10. Pure Python — pyyaml + pydantic only.


Pricing

  • CLI / library: MIT-licensed, free, forever.
  • Hosted CI (planned): dc-cite-ci GitHub App that runs on every PR touching a contract, comments the diff of regulatory citations and blocks the merge if a field acquires a new regime without sign-off. €19/mo per repo, €49/mo for organizations. Stripe billing.

The free CLI is the whole product. Hosted CI exists for teams that don't want to wire it into their own pipeline.


Regimes covered (v0.1)

Regime Source Coverage
GDPR Regulation (EU) 2016/679 Art. 4(1), 4(13–15), 9(1), 32, 33, 35
HIPAA 45 CFR §164 §164.514(b)(2) Safe Harbor identifiers, §164.312 technical safeguards
EHDS Regulation (EU) 2025/327 Arts. on electronic health data (placeholder, expands as final text stabilises)
CRA Regulation (EU) 2024/2847 Annex I essential cybersecurity requirements for products with digital elements

The rule database is src/data_contract_cite/data/citation_rules.yaml — open for PRs that add jurisdictions (UK GDPR, LGPD, CCPA, PIPEDA, APPI).


Audit chain

Every manifest carries:

  • generator — tool name and version that produced the manifest
  • source_sha256 — SHA-256 of the source ODCS contract
  • rules_sha256 — SHA-256 of citation_rules.yaml at generation time
  • manifest_sha256 — SHA-256 of the manifest body (excluding this field)

dc-cite validate manifest.yaml --contract contract.yaml recomputes all three and exits 0 only if every hash matches. That single command is what your CI runs to prove the manifest is current.


Status

Alpha. v0.1 ships with ~15 rules covering the most common PII/PHI/health-data field patterns. The bottleneck on coverage is regulatory review of new rules, not engineering — PRs against citation_rules.yaml that include the verbatim clause + source URL are the fastest path to broader regime coverage.

This project is independent of and not endorsed by the Bitol Foundation.


License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_contract_cite-0.1.0.tar.gz (25.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_contract_cite-0.1.0-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file data_contract_cite-0.1.0.tar.gz.

File metadata

  • Download URL: data_contract_cite-0.1.0.tar.gz
  • Upload date:
  • Size: 25.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for data_contract_cite-0.1.0.tar.gz
Algorithm Hash digest
SHA256 df1c87cfcb43cb97f7f69e912aceba4ecfcb58889dbc2dd289b08c3b1fa42bb2
MD5 78bfb0364027ea657d33a7716c0f3d25
BLAKE2b-256 531fce4dad0162de5a6aa791b8eb9905a3829dbbdf7ece407438773683573fa0

See more details on using hashes here.

File details

Details for the file data_contract_cite-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for data_contract_cite-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dc427370606f78c08c178078a1ba4506ef264156d59b48f4b811f4766cd59bac
MD5 3b1c81c3d0414feee69666058ff5f8f1
BLAKE2b-256 f8e7b8d9b8c86c319189167b6052af3463432eedd5aee387dd29ef1ad67e7293

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page