Skip to main content

OpenDQV Core — open-source, contract-driven data quality validation engine for data pipelines and API boundaries

Project description

OpenDQV — Open Data Quality Validation

CI License: MIT Python 3.11+ PyPI Docker Platforms OpenSSF Scorecard Coverage Ruff OpenSSF Best Practices

Quickstart Rules Contracts MCP API Security FAQ

"Trust is easier to build than to repair." That is why OpenDQV exists. A 422 at the point of write is cheaper than a data incident three weeks later.

Beta (v2.x). Public API surface (REST, contract YAML, MCP tools, Python SDK) is stable. Breaking changes follow a one-release deprecation cycle. Security fixes backported to the latest 2.x line. See API Stability for commitments.

OpenDQV is a write-time data validation service. Source systems call it before writing data. Bad records return a 422 with per-field errors. Good records pass through. No payload is stored.

OpenDQV demo — define a contract, send a bad record (get a 422), fix it (get a 200)

flowchart LR
    subgraph Callers
        direction TB
        SF[Salesforce]
        SAP[SAP]
        DYN[Dynamics]
        ORA[Oracle]
        WEB[Web forms]
        ETL1[ETL pipelines]

        DJ[Django clean]
        PY[Python scripts]
        PD[Pandas / ETL]

        CD[Claude Desktop]
        CUR[Cursor]
        LLM[LLM agents]
    end

    subgraph OpenDQV
        direction TB
        API[Validation API\nREST / batch]
        SDK[LocalValidator\nin-process SDK]
        MCP[MCP Server\nAI-native]
        API & SDK & MCP --> CON[Contracts · YAML\nGovernance · RBAC\nAudit trail]
        API & SDK & MCP --> GEN[Code Generator\nApex · JS · SQL]
    end

    subgraph Results
        direction TB
        R1[valid: true / false]
        R2[per-field errors]
        R3[severity levels]
        R4[webhooks on events]
    end

    SF & SAP & DYN & ORA & WEB & ETL1 --> API
    DJ & PY & PD --> SDK
    CD & CUR & LLM --> MCP

    API & SDK & MCP --> R1

    subgraph Importers
        IMP[dbt schema · GX suites\nSoda checks · ODCS · CSV]
    end
    IMP --> CON

    style API fill:#0d3b5e,stroke:#092a44,color:#fff
    style SDK fill:#0d3b5e,stroke:#092a44,color:#fff
    style MCP fill:#0d3b5e,stroke:#092a44,color:#fff
    style CON fill:#1a8aad,stroke:#14708d,color:#fff
    style GEN fill:#1a8aad,stroke:#14708d,color:#fff
    style R1 fill:#2ec4e6,stroke:#1a8aad,color:#0d3b5e
    style R2 fill:#2ec4e6,stroke:#1a8aad,color:#0d3b5e
    style R3 fill:#2ec4e6,stroke:#1a8aad,color:#0d3b5e
    style R4 fill:#2ec4e6,stroke:#1a8aad,color:#0d3b5e
    style IMP fill:#1a8aad,stroke:#14708d,color:#fff

A 422 at the point of write closes the feedback loop — producers see failures immediately and fix them at source. Rejection rates drop over time because the tool changes the incentive, not just the outcome.

For post-landing monitoring use Great Expectations, Soda, or dbt tests — they're complementary, not competing. OpenDQV owns layer one (write-time enforcement); those tools own layer three (post-ingestion observability).


AI Agents — first-class via MCP

OpenDQV ships a built-in Model Context Protocol server, so Claude Desktop, Cursor, and any other MCP-compatible agent can discover contracts, validate records, and explain failures through tool calls the agent explicitly declares — no hallucinated compliance, no invented rules.

Watch the 4-minute MCP demo

4-minute demo: Claude Desktop uses two MCP servers — OpenDQV for validation, Marmot for catalog lineage — to check a menu item against ppds_menu_item for Natasha's Law allergen compliance, stating which tool calls it makes and why. (Backup: download the MP4 from the repo)

For tool reference, write guardrails, remote/enterprise mode, and the Marmot composition pattern, see docs/mcp.md.

Reserved agent_id prefix. The prefix OpenDQV_SA_ is reserved for OpenDQV-owned system traffic — smoke probes, demos, MCP self-tests, perf harnesses. The pattern is OpenDQV_SA_[Category]_[Scope] (e.g. OpenDQV_SA_smoke_v240, OpenDQV_SA_probe_persona_b). Customer-facing metrics endpoints (/api/v1/stats, /api/v1/agents, MCP get_quality_metrics, MCP list_agents) suppress these by default so tenant views stay clean of dev/test traffic. Pass include_system=true to surface them for diagnostics — each row carries an is_system_agent flag.


Install

I have... Command
Python 3.11+ git clone https://github.com/OpenDQV/OpenDQV.git && cd OpenDQV && bash install.sh
Docker git clone https://github.com/OpenDQV/OpenDQV.git && cd OpenDQV && cp .env.example .env && docker compose up -d
Just the SDK/CLI pip install opendqv then opendqv init to bootstrap contracts
None of the above Beginner setup guide →

install.sh creates a virtual environment, installs dependencies, and launches the onboarding wizard. Docker pulls ghcr.io/opendqv/opendqv:latest — no build step required.

⚠️ AUTH_MODE=open (the default) has no authentication. Set AUTH_MODE=token and a strong SECRET_KEY in .env before any non-local deployment. See SECURITY.md.


Your First Validation

1. Write a contract — drop a YAML file in your contracts directory (run opendqv init --all to copy the 43 bundled contracts, or opendqv init for a single starter):

contract:
  name: order
  version: "1.0"
  owner: "Data Governance"
  status: active
  rules:
    - name: valid_email
      type: regex
      field: email
      pattern: "^[^@\\s]+@[^@\\s]+\\.[^@\\s]+$"
      severity: error
      error_message: "Invalid email format"
    - name: amount_positive
      type: min
      field: amount
      min: 0.01
      severity: error
      error_message: "Order amount must be positive"
    - name: status_valid
      type: allowed_values
      field: status
      allowed_values: [pending, confirmed, shipped, cancelled]
      severity: error
      error_message: "Invalid order status"

2. Reload contracts:

curl -X POST http://localhost:8000/api/v1/contracts/reload

3. Send a bad record — OpenDQV rejects it:

curl -s -X POST http://localhost:8000/api/v1/validate \
  -H "Content-Type: application/json" \
  -d '{"contract": "order", "record": {"email": "not-an-email", "amount": -5, "status": "unknown"}}'
{
  "valid": false,
  "errors": [
    {"field": "email",  "rule": "valid_email",    "message": "Invalid email format",        "severity": "error"},
    {"field": "amount", "rule": "amount_positive", "message": "Order amount must be positive", "severity": "error"},
    {"field": "status", "rule": "status_valid",    "message": "Invalid order status",        "severity": "error"}
  ],
  "contract": "order",
  "version": "1.0"
}

4. Fix the record — it passes:

curl -s -X POST http://localhost:8000/api/v1/validate \
  -H "Content-Type: application/json" \
  -d '{"contract": "order", "record": {"email": "alice@example.com", "amount": 49.99, "status": "pending"}}'
{"valid": true, "errors": [], "warnings": [], "contract": "order", "version": "1.0"}

The customer contract ships pre-seeded if you want to skip step 1. The quickstart guide walks through authoring, lifecycle, and batch validation.


Rules

Type What it checks
not_empty Field is present and non-empty
regex Field matches (or does not match) a pattern. Built-ins: builtin:email, builtin:uuid, builtin:ipv4, builtin:url
min / max / range Numeric bounds
min_length / max_length String length
date_format Parseable date/datetime. Falls back through common formats if no explicit format is set
allowed_values Value must be in a fixed list
lookup Value must appear in a local file or HTTP endpoint (with TTL cache)
compare Cross-field: field op compare_to — supports gt, lt, gte, lte, eq, neq, and today/now sentinels
required_if / forbidden_if Conditional: required or forbidden when another field equals a value
checksum Check-digit integrity: IBAN, GTIN/GS1, NHS, ISIN, LEI, VIN, CPF, ISRC
unique No duplicates within a batch (batch mode only)
cross_field_range Value must be between two other fields in the same record
field_sum Sum of named fields must equal a target (within optional tolerance)
geospatial_bounds Lat/lon pair within a bounding box
date_diff Difference between two date fields within a range
age_match Declared age consistent with date-of-birth field

Rules have severity: error (blocks the record) or severity: warning (flags but allows). Any rule can include a condition block to apply it only when another field equals a given value.

Full reference: docs/rules/


How it compares

A mature data governance programme operates across three layers, each with a distinct job:

Layer Purpose Tools
1. Write-time enforcement Prevent bad data from entering any system OpenDQV
2. Catalog / governance / stewardship Ownership, glossary, lineage, policy, stewardship workflows Alation, Atlan, Collibra, Purview, DataHub, Marmot
3. Pipeline testing / observability Detect drift, freshness issues, residual quality after ingestion Great Expectations, Soda Core, dbt tests, Monte Carlo

OpenDQV Core owns layer one. Your catalog handles layer two, your pipeline tools handle layer three.

Great Expectations / Soda / dbt OpenDQV
When After data lands (in warehouse/lake) Before data is written (at the door)
Where Data pipelines, batch jobs Source system integration points
Model Scan data at rest Validate data in flight
Latency Minutes to hours (batch) Milliseconds (API call)
Who calls it Data engineers Data engineers, developers, CRM admins

They're complementary. Use Great Expectations to monitor your warehouse. Use OpenDQV to stop bad data from getting there in the first place.


Contracts

43 production-ready contracts ship inside the opendqv package covering GDPR, HIPAA, SOX, MiFID II, UK Building Safety Act, Martyn's Law, Natasha's Law, Ofcom Online Safety Act, EU DORA, and 20+ other regulatory frameworks across UK, EU, and US. pip install opendqv gives you all of them — opendqv list works with zero configuration.

See docs/compliance-contracts.md for the full list with regulatory context, or browse opendqv/contracts/ directly. 17 minimal starter templates are in examples/starter_contracts/.


Performance

EC2 c6i.large, 2 workers, 12-rule contract, mixed 50/50 workload: ~482 req/s, p99 ~182 ms. Sizing rule: WEB_CONCURRENCY = number of vCPUs.

See docs/benchmark_throughput.md for full platform comparison, methodology, and monthly volume extrapolation.


Documentation

Quickstart Build your first contract in 15 minutes
Rules Reference All rule types with parameters and examples
Compliance Contracts 44 contracts with regulatory context
API Reference REST endpoints, SDK, GraphQL, webhooks
Security Deployment checklist, threat model, RBAC
Production Deployment Token auth, TLS, Docker Compose, hardening
Integrations Salesforce, Kafka, Snowflake, dbt, Databricks, MCP, and more
All docs → 76 documentation files

API Stability

OpenDQV is in Beta as of 2.0.0. The following stability commitments apply to the v2.x series:

  • REST API endpoints — paths, request bodies, and response shapes are stable within v2.x. Backwards-incompatible changes require a major version bump and follow a deprecation cycle (one minor release of warnings before removal).
  • YAML contract format — the contract schema (rules, fields, types) is stable within v2.x. New rule types may be added; existing rules will not change semantics without a deprecation cycle.
  • Python SDKOpenDQVClient, AsyncOpenDQVClient, and LocalValidator public method signatures are stable within v2.x. Internal helpers (prefixed _) are not covered.
  • MCP tools — tool names and parameters are stable within v2.x.
  • Security fixes — backported to the latest 2.x line on a best-effort basis.

Known limitations in v2.2.x

  • Rule null handling is inconsistent. Most format rules fail when the target field is missing; a few (max_length, allowed_values) pass silently; field_sum and ratio_check coerce missing operands to 0. Single-record and batch paths disagree in a few cases. See docs/rules/core_rules.md for the full matrix and the safe pattern to use today. v2.3.0 will make this consistent (loud-by-default with an optional: true opt-out).
  • Unknown rule types pass silently at runtime. A typo in type: (e.g. min_lenght) is caught by opendqv lint but not by the engine — a typo'd rule is a disabled rule. Always lint before deploy. v2.3.0 will reject unknown types at contract load.

Contributing

See CONTRIBUTING.md for setup instructions, coding guidelines, and how to submit changes.

License

MIT — see LICENSE.

Acknowledgements

Led by Sunny Sharma, BGMS Consultants Ltd. The vision, the architecture, every contract, and every design decision in this repository are directed by a human who believes data quality is a write-time responsibility.

OpenDQV is built with a hybrid team. Sunny leads — carbon and silicon. Three AI collaborators execute: Claude Sonnet 4.6 (primary developer), Claude Opus 4.6 (strategic auditor), and Grok (market intelligence). All answer to the same ethos: trust is easier to build than to repair.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opendqv-2.3.18.tar.gz (286.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opendqv-2.3.18-py3-none-any.whl (345.2 kB view details)

Uploaded Python 3

File details

Details for the file opendqv-2.3.18.tar.gz.

File metadata

  • Download URL: opendqv-2.3.18.tar.gz
  • Upload date:
  • Size: 286.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for opendqv-2.3.18.tar.gz
Algorithm Hash digest
SHA256 345701a13d6a98d39aee53b94060d19c7d6f1b7fc47837f7d149ab18084049be
MD5 bcf879e68e837a8132e1f4fe43514d1d
BLAKE2b-256 16228a6b0e947a7b97278640a84e775c2ac5ee39eec72c789289d2a955b59145

See more details on using hashes here.

Provenance

The following attestation bundles were made for opendqv-2.3.18.tar.gz:

Publisher: publish.yml on OpenDQV/OpenDQV

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file opendqv-2.3.18-py3-none-any.whl.

File metadata

  • Download URL: opendqv-2.3.18-py3-none-any.whl
  • Upload date:
  • Size: 345.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for opendqv-2.3.18-py3-none-any.whl
Algorithm Hash digest
SHA256 fbed64447ac391c6f19ae4f9deed05629bbba6a571ea375bfd93b60a017a95ac
MD5 24fa363d0ea11754755bd72dd977312e
BLAKE2b-256 a8121930b526f77003a248c58428b8d676090d02ba47165ee7dfae675ec91000

See more details on using hashes here.

Provenance

The following attestation bundles were made for opendqv-2.3.18-py3-none-any.whl:

Publisher: publish.yml on OpenDQV/OpenDQV

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page