Data Quality eXcellence

Project description

DQX - Data Quality Excellence

Transform data validation into mathematical expressions. Find issues before they find you.

Write quality checks as code, not SQL. Get instant feedback on millions of rows.

Why DQX?

Before: Complex SQL that's hard to maintain

WITH metrics AS (
  SELECT
    COUNT(*) as total,
    SUM(CASE WHEN price IS NULL THEN 1 ELSE 0 END) as null_prices,
    AVG(price) as avg_price
  FROM orders
)
SELECT
  CASE
    WHEN null_prices::float / total > 0.05 THEN 'FAIL: Too many nulls'
    WHEN avg_price < 10 THEN 'FAIL: Prices too low'
    ELSE 'PASS'
  END as result
FROM metrics;

After: Clear, testable validation logic

@check(name="Order quality")
def validate_orders(mp: MetricProvider, ctx: Context) -> None:
    null_rate = mp.null_count("price") / mp.num_rows()
    ctx.assert_that(null_rate).where(name="Price completeness").is_leq(0.05)
    ctx.assert_that(mp.average("price")).where(name="Price reasonableness").is_geq(10.0)

Quick Start

pip install dqx

Then validate your data:

import pyarrow as pa
import sympy as sp
from dqx.api import check, VerificationSuite, MetricProvider, Context
from dqx.common import ResultKey
from dqx.datasource import DuckRelationDataSource
from dqx.orm.repositories import InMemoryMetricDB


# Your business rules as code
@check(name="Revenue integrity")
def validate_revenue(mp: MetricProvider, ctx: Context) -> None:
    # Catch calculation errors
    calculated = mp.sum("price") * mp.sum("quantity")
    reported = mp.sum("revenue")
    error_rate = sp.Abs(calculated - reported) / reported

    ctx.assert_that(error_rate).where(
        name="Revenue calculation accuracy", severity="P0"
    ).is_lt(
        0.001
    )  # Less than 0.1% error


# Load your data
data = pa.Table.from_pydict(
    {"price": [10.5, 20.0, 15.5], "quantity": [2, 1, 3], "revenue": [21.0, 20.0, 46.5]}
)

# Run validation
db = InMemoryMetricDB()
suite = VerificationSuite([validate_revenue], db, "Daily validation")
suite.run([DuckRelationDataSource.from_arrow(data)], ResultKey())
# ✓ Revenue integrity: OK

What DQX Solves

Real examples from production data pipelines:

📊 Data Completeness

"Are critical fields populated?"

@check(name="Customer data quality")
def check_completeness(mp: MetricProvider, ctx: Context) -> None:
    # Flag if more than 5% of customer IDs are missing
    null_rate = mp.null_count("customer_id") / mp.num_rows()
    ctx.assert_that(null_rate).where(
        name="Customer ID completeness", severity="P0"
    ).is_lt(0.05)

📈 Trend Monitoring

"Did revenue drop unexpectedly?"

@check(name="Revenue stability")
def monitor_trends(mp: MetricProvider, ctx: Context) -> None:
    # Alert on >20% daily revenue changes
    daily_change = mp.sum("revenue") / mp.sum("revenue", lag=1)
    ctx.assert_that(daily_change).where(
        name="Daily revenue stability", severity="P0"
    ).is_between(0.8, 1.2)

    # Track week-over-week growth
    wow_change = mp.sum("revenue") / mp.sum("revenue", lag=7)
    ctx.assert_that(wow_change).where(name="Weekly revenue trend").is_geq(
        0.95
    )  # Allow 5% decline

🔍 Data Integrity

"Are there duplicate transactions?"

@check(name="Transaction integrity")
def check_integrity(mp: MetricProvider, ctx: Context) -> None:
    # No duplicate transaction IDs allowed
    ctx.assert_that(mp.duplicate_count(["transaction_id"])).where(
        name="Transaction uniqueness", severity="P0"
    ).is_eq(0)

    # Validate business rules
    ctx.assert_that(mp.minimum("quantity")).where(
        name="Positive quantities only"
    ).is_positive()

⚖️ Cross-Dataset Validation

"Do staging and production match?"

@check(name="Environment consistency", datasets=["staging", "production"])
def compare_environments(mp: MetricProvider, ctx: Context) -> None:
    # Ensure staging matches production
    prod_total = mp.sum("revenue", dataset="production")
    staging_total = mp.sum("revenue", dataset="staging")

    ctx.assert_that(prod_total).where(name="Prod-staging revenue match").is_eq(
        staging_total, tol=0.01
    )  # 1% tolerance

Core Concepts

Just three simple ideas:

Checks = Your validation logic (decorated functions)
Metrics = What you measure (sum, average, null_count)
Assertions = Your rules (is_between, is_positive, is_eq)

Common Patterns

Validate Percentages

# Check null rate
null_rate = mp.null_count("email") / mp.num_rows()
ctx.assert_that(null_rate).where(name="Email completeness").is_lt(0.1)  # <10%

# Check category distribution
fraud_rate = mp.count_values("status", "fraud") / mp.num_rows()
ctx.assert_that(fraud_rate).where(name="Fraud rate", severity="P0").is_lt(0.001)

# Check cardinality
unique_users = mp.unique_count("user_id")
total_orders = mp.num_rows()
ctx.assert_that(unique_users / total_orders).where(name="User diversity").is_gt(
    0.3
)  # >30% unique users

Monitor Complex Metrics

# Standard deviation over time
tax_volatility = mp.ext.stddev(mp.average("tax_rate"), lag=1, n=30)
ctx.assert_that(tax_volatility).where(name="Tax rate stability").is_lt(0.05)

# Multi-step calculations
conversion_rate = mp.count_values("converted", True) / mp.count_values("visited", True)
ctx.assert_that(conversion_rate).where(name="Conversion rate").is_between(0.02, 0.10)

Detect Anomalies

# Spike detection
max_payment = mp.maximum("payment_amount")
avg_payment = mp.average("payment_amount")
spike_ratio = max_payment / avg_payment

ctx.assert_that(spike_ratio).where(name="Payment spike detection", severity="P1").is_lt(
    100
)  # Max should be <100x average

Development & Tooling

Setup

git clone <repo>
cd dqx
./bin/setup-dev-env.sh  # One-time setup

Daily Workflow

# Run tests with coverage
uv run pytest --cov=dqx

# Check code quality (auto-fixes many issues)
uv run hooks

# Commit with conventional format
uv run cz commit  # Interactive commit helper

# Clean up temporary files
uv run cleanup

Key Tools

uv - Fast Python package & environment manager
pytest - Testing with 100% coverage requirement
ruff - Lightning-fast Python linter
mypy - Static type checking
commitizen - Conventional commit enforcement

Quick Reference

Metrics Available

Metric	Description	Example
`num_rows()`	Total row count	Basic count
`null_count(col)`	Null values in column	Data completeness
`sum(col)`	Sum of values	Totals
`average(col)`	Mean value	Averages
`minimum(col)` / `maximum(col)`	Min/max values	Range checks
`duplicate_count([cols])`	Duplicate rows	Uniqueness
`count_values(col, value)`	Count specific values	Categories
`unique_count(col)`	Distinct values	Cardinality

Assertions Available

Assertion	Description	Example
`is_eq(value, tol)`	Equals (with tolerance)	Exact match
`is_between(min, max)`	In range (inclusive)	Valid ranges
`is_positive()` / `is_zero()`	Sign checks	Amounts
`is_gt(val)` / `is_geq(val)`	Greater than (or equal)	Thresholds
`is_lt(val)` / `is_leq(val)`	Less than (or equal)	Limits

Time-based Analysis

# Yesterday's value
mp.sum("revenue", lag=1)

# Last week's value
mp.sum("revenue", lag=7)

# Day-over-day change
mp.ext.day_over_day(mp.sum("revenue"))

# Week-over-week change
mp.ext.week_over_week(mp.sum("revenue"))

Installation Options

# Basic installation
pip install dqx

# With PostgreSQL support
pip install "dqx[postgres]"

# Development setup
pip install "dqx[dev]"

Learn More

Examples - Real-world validation scenarios
Design Document - Architecture decisions
Plugin System - Extend DQX functionality

License

MIT License. See LICENSE for details.

DQX makes data quality a first-class concern. Stop writing SQL. Start validating data.

Project details

Release history Release notifications | RSS feed

0.5.23

Mar 11, 2026

0.5.22

Mar 11, 2026

0.5.21

Mar 11, 2026

0.5.20

Mar 2, 2026

0.5.18

Feb 28, 2026

0.5.17

Jan 29, 2026

0.5.16

Jan 28, 2026

0.5.15

Jan 21, 2026

0.5.14

Jan 19, 2026

0.5.13

Jan 19, 2026

0.5.12

Jan 19, 2026

0.5.11

Dec 26, 2025

0.5.10

Nov 12, 2025

0.5.9

Nov 4, 2025

0.5.8

Nov 3, 2025

0.5.7

Nov 3, 2025

0.5.6

Nov 3, 2025

0.5.5

Oct 30, 2025

0.5.4

Oct 29, 2025

0.5.3

Oct 29, 2025

This version

0.5.2

Oct 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dqlib-0.5.2.tar.gz (70.0 kB view details)

Uploaded Oct 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dqlib-0.5.2-py3-none-any.whl (82.2 kB view details)

Uploaded Oct 29, 2025 Python 3

File details

Details for the file dqlib-0.5.2.tar.gz.

File metadata

Download URL: dqlib-0.5.2.tar.gz
Upload date: Oct 29, 2025
Size: 70.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dqlib-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`cda0b04d99fa7eb3831a03dee84d858c49dfdba9381aebcefccdafb8282b7a6c`
MD5	`ce377676317d6190fa5e6b354d64a811`
BLAKE2b-256	`05e69ab21aac31649db61532cecd12c354ec36b1ffd133122284ab57f5307b6d`

See more details on using hashes here.

File details

Details for the file dqlib-0.5.2-py3-none-any.whl.

File metadata

Download URL: dqlib-0.5.2-py3-none-any.whl
Upload date: Oct 29, 2025
Size: 82.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dqlib-0.5.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`24627feb09338320b6e5cecf7731f4d83766a70c524cc0fcbd0d36500df922a2`
MD5	`5281f47c6511db7a0c55f1baf7448075`
BLAKE2b-256	`6f995c0b0ed84708611b0e48ee475c6b7ee20dd3c3dc5008f7ea95e778d18377`

See more details on using hashes here.

dqlib 0.5.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

DQX - Data Quality Excellence

Why DQX?

Before: Complex SQL that's hard to maintain

After: Clear, testable validation logic

Quick Start

What DQX Solves

📊 Data Completeness

📈 Trend Monitoring

🔍 Data Integrity

⚖️ Cross-Dataset Validation

Core Concepts

Common Patterns

Validate Percentages

Monitor Complex Metrics

Detect Anomalies

Development & Tooling

Setup

Daily Workflow

Key Tools

Quick Reference

Metrics Available

Assertions Available

Time-based Analysis

Installation Options

Learn More

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes