Skip to main content

DuckDB-native data contract validator โ€” validate datasets against YAML-defined contracts

Project description

ducktor ๐Ÿฆ†

DuckDB-native data contract validator.

Define what your data must look like in a YAML file. Run one command. Get a pass or fail.

No server. No account. No boilerplate. Just DuckDB.

pip install ducktor
ducktor validate orders_contract.yaml
  orders โ†’ data/orders.parquet  PASSED

  Check                              Status    Detail
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  order_id :: not_null               PASS
  order_id :: unique                 PASS
  status :: allowed_values           PASS
  amount :: not_null                 PASS
  amount :: min[0.0]                 PASS
  amount :: max[100000.0]            FAIL      3 row(s) violated
  created_at :: freshness[<=48h]     FAIL      got 73.2, expected <= 48

  7 checks  |  5 passed  |  2 failed

Why ducktor

ducktor Great Expectations dbt tests Soda Cloud
Setup pip install Data Context + config dbt only Cloud account
Source Any DuckDB-readable Connectors dbt models Connectors
Contract format YAML Python classes YAML (limited) YAML + cloud
CI-friendly โœ… exit code โœ… โœ… Paid
Show SQL โœ… always โœ— โœ— โœ—
Local-first โœ… โœ— โœ— โœ—

Install

pip install ducktor

Requires Python 3.10+. DuckDB is bundled โ€” no extra installs.


Quickstart

1. Profile your data (generate a starter contract)

ducktor profile data/orders.parquet --output orders_contract.yaml

This scans your file and infers column types, null rates, value ranges, and allowed values.

2. Tweak the contract

version: 1
name: orders
source:
  type: parquet
  path: data/orders.parquet

columns:
  order_id:
    type: INTEGER
    nullable: false
    unique: true
  status:
    type: VARCHAR
    nullable: false
    allowed_values: [pending, shipped, delivered, cancelled]
  amount:
    type: DOUBLE
    nullable: false
    min: 0.0
    max: 100000.0
  created_at:
    type: TIMESTAMP
    nullable: false

dataset:
  min_rows: 1000
  max_null_rate:
    amount: 0.0
    status: 0.05
  freshness:
    column: created_at
    max_age_hours: 48

3. Validate

ducktor validate orders_contract.yaml

4. Diff contracts before deploying changes

ducktor diff orders_v1.yaml orders_v2.yaml

Contract Reference

Source types

source:
  type: parquet   # local .parquet file
  path: data/orders.parquet

# or
source:
  type: csv
  path: data/orders.csv

# or
source:
  type: json
  path: data/orders.json

# or โ€” S3 / GCS / R2 (requires httpfs)
source:
  type: s3
  path: s3://my-bucket/orders/2026-06-28.parquet

# or โ€” Postgres
source:
  type: postgres
  path: postgresql://user:pass@host/dbname::public.orders

Column checks

Check YAML key Description
Type assertion type: INTEGER Column must be castable to this type
Not null nullable: false Zero nulls allowed
Unique unique: true All non-null values must be distinct
Minimum min: 0.0 No values below this
Maximum max: 100000.0 No values above this
Allowed values allowed_values: [a, b, c] Only these values permitted
Pattern pattern: "^[A-Z]{2}\\d{4}$" All values must match regex
Custom SQL custom_sql: "amount > 0 AND amount < total" Expression must be true for all rows

Dataset checks

Check YAML key Description
Min rows min_rows: 1000 Dataset must have at least N rows
Max rows max_rows: 10000000 Dataset must have at most N rows
Null rate max_null_rate: {col: 0.05} Column null rate must not exceed threshold
Freshness freshness: {column: ts, max_age_hours: 48} Most recent timestamp must be within N hours

CLI Reference

# Validate a contract
ducktor validate orders_contract.yaml

# Validate with JSON output (for CI)
ducktor validate orders_contract.yaml --output json

# Override source path at runtime
ducktor validate orders_contract.yaml --source s3://bucket/orders/2026-06-28.parquet

# Profile a source and generate a starter contract
ducktor profile data/orders.parquet
ducktor profile data/orders.parquet --output orders_contract.yaml
ducktor profile data/orders.csv --type csv

# Diff two contracts
ducktor diff orders_v1.yaml orders_v2.yaml
ducktor diff orders_v1.yaml orders_v2.yaml --output json

Exit codes:

  • 0 โ€” all checks passed (or no breaking changes for diff)
  • 1 โ€” one or more checks failed (or breaking changes detected)
  • 2 โ€” parse or engine error (bad YAML, file not found, etc.)

Python Library

from ducktor import validate

# Simple
result = validate("orders_contract.yaml")
print(result.passed)       # True / False
print(result.summary)      # {"total": 9, "passed": 8, "failed": 1}

# With source override
result = validate(
    "orders_contract.yaml",
    source="s3://prod-bucket/orders/2026-06-28.parquet",
)

# Inspect individual checks
for check in result.failed_checks:
    print(f"FAILED: {check.name}")
    print(f"  detail: {check.detail}")
    print(f"  sql:    {check.sql}")   # exact SQL that ran

Using in Airflow

from airflow.operators.python import PythonOperator
from ducktor import validate

def validate_orders(**context):
    result = validate(
        "contracts/orders_contract.yaml",
        source=f"s3://bucket/orders/{context['ds']}.parquet",
    )
    if not result.passed:
        failed = [c.name for c in result.failed_checks]
        raise ValueError(f"Contract failed: {failed}")

validate_task = PythonOperator(
    task_id="validate_orders",
    python_callable=validate_orders,
)

Using in Prefect

from prefect import task
from ducktor import validate

@task
def validate_orders(partition: str):
    result = validate(
        "contracts/orders_contract.yaml",
        source=f"s3://bucket/orders/{partition}.parquet",
    )
    if not result.passed:
        raise RuntimeError(f"{result.summary['failed']} checks failed")
    return result.summary

CI / CD

GitHub Actions

name: Data Contract Validation
on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install ducktor
      - run: ducktor validate contracts/orders_contract.yaml

JSON output for downstream steps

      - run: ducktor validate contracts/orders_contract.yaml --output json > validation.json
      - name: Upload validation report
        uses: actions/upload-artifact@v4
        with:
          name: validation-report
          path: validation.json

Pre-commit hook

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: ducktor
        name: Validate data contracts
        entry: ducktor validate
        args: [contracts/orders_contract.yaml]
        language: system
        pass_filenames: false

How it works

Every check compiles down to a DuckDB SQL query. You can always see exactly what ran:

from ducktor import validate

result = validate("orders_contract.yaml")
for check in result.checks:
    print(f"{check.name}: {check.status.value}")
    print(f"  {check.sql}")

Example SQL for a not_null check:

SELECT COUNT(*) FROM read_parquet('data/orders.parquet') WHERE order_id IS NULL

Zero violating rows = PASS. No magic, no hidden logic.


Contributing

See CONTRIBUTING.md.

git clone https://github.com/yourusername/ducktor
cd ducktor
pip install -e ".[dev]"
pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ducktor-0.1.0.tar.gz (72.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ducktor-0.1.0-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file ducktor-0.1.0.tar.gz.

File metadata

  • Download URL: ducktor-0.1.0.tar.gz
  • Upload date:
  • Size: 72.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for ducktor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 65f747b82d6f44ccea9a10c17bbd7d1c115c9308e33c49cf5b5797a0da819e31
MD5 539a55f7aa96fb76e7ddb95ea5350f0b
BLAKE2b-256 da5a849d889f44476d1eeccbb0b061232a49a45b7d802c159b5783a9b1ea094d

See more details on using hashes here.

File details

Details for the file ducktor-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ducktor-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for ducktor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 58dd64d6edc2cd5272dbcc0fc56e689ae9b5934890646152cab389a7b97d7b37
MD5 b39abe371ae499426b91561bc3b3f153
BLAKE2b-256 99289cace0c62fc0f164ebbd56d20102504fbba6348b73b3509bbb0f98853bc1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page