Skip to main content

Data Quality eXcellence

Project description

DQX - Data Quality Excellence

Transform data validation into mathematical expressions. Find issues before they find you.

Python License Code style: black Documentation Status

Write quality checks as code, not SQL. Get instant feedback on millions of rows.

Why DQX?

Before: Complex SQL that's hard to maintain

WITH metrics AS (
  SELECT
    COUNT(*) as total,
    SUM(CASE WHEN price IS NULL THEN 1 ELSE 0 END) as null_prices,
    AVG(price) as avg_price
  FROM orders
)
SELECT
  CASE
    WHEN null_prices::float / total > 0.05 THEN 'FAIL: Too many nulls'
    WHEN avg_price < 10 THEN 'FAIL: Prices too low'
    ELSE 'PASS'
  END as result
FROM metrics;

After: Clear, testable validation logic

@check(name="Order quality")
def validate_orders(mp: MetricProvider, ctx: Context) -> None:
    null_rate = mp.null_count("price") / mp.num_rows()
    ctx.assert_that(null_rate).where(name="Price completeness").is_leq(0.05)
    ctx.assert_that(mp.average("price")).where(name="Price reasonableness").is_geq(10.0)

Quick Start

pip install dqx

Then validate your data:

import pyarrow as pa
import sympy as sp
from dqx.api import check, VerificationSuite, MetricProvider, Context
from dqx.common import ResultKey
from dqx.datasource import DuckRelationDataSource
from dqx.orm.repositories import InMemoryMetricDB


# Your business rules as code
@check(name="Revenue integrity")
def validate_revenue(mp: MetricProvider, ctx: Context) -> None:
    # Catch calculation errors
    calculated = mp.sum("price") * mp.sum("quantity")
    reported = mp.sum("revenue")
    error_rate = sp.Abs(calculated - reported) / reported

    ctx.assert_that(error_rate).where(
        name="Revenue calculation accuracy", severity="P0"
    ).is_lt(
        0.001
    )  # Less than 0.1% error


# Load your data
data = pa.Table.from_pydict(
    {"price": [10.5, 20.0, 15.5], "quantity": [2, 1, 3], "revenue": [21.0, 20.0, 46.5]}
)

# Run validation
db = InMemoryMetricDB()
suite = VerificationSuite([validate_revenue], db, "Daily validation")
suite.run([DuckRelationDataSource.from_arrow(data)], ResultKey())
# ✓ Revenue integrity: OK

What DQX Solves

Real examples from production data pipelines:

📊 Data Completeness

"Are critical fields populated?"

@check(name="Customer data quality")
def check_completeness(mp: MetricProvider, ctx: Context) -> None:
    # Flag if more than 5% of customer IDs are missing
    null_rate = mp.null_count("customer_id") / mp.num_rows()
    ctx.assert_that(null_rate).where(
        name="Customer ID completeness", severity="P0"
    ).is_lt(0.05)

📈 Trend Monitoring

"Did revenue drop unexpectedly?"

@check(name="Revenue stability")
def monitor_trends(mp: MetricProvider, ctx: Context) -> None:
    # Alert on >20% daily revenue changes
    daily_change = mp.sum("revenue") / mp.sum("revenue", lag=1)
    ctx.assert_that(daily_change).where(
        name="Daily revenue stability", severity="P0"
    ).is_between(0.8, 1.2)

    # Track week-over-week growth
    wow_change = mp.sum("revenue") / mp.sum("revenue", lag=7)
    ctx.assert_that(wow_change).where(name="Weekly revenue trend").is_geq(
        0.95
    )  # Allow 5% decline

🔍 Data Integrity

"Are there duplicate transactions?"

@check(name="Transaction integrity")
def check_integrity(mp: MetricProvider, ctx: Context) -> None:
    # No duplicate transaction IDs allowed
    ctx.assert_that(mp.duplicate_count(["transaction_id"])).where(
        name="Transaction uniqueness", severity="P0"
    ).is_eq(0)

    # Validate business rules
    ctx.assert_that(mp.minimum("quantity")).where(
        name="Positive quantities only"
    ).is_positive()

⚖️ Cross-Dataset Validation

"Do staging and production match?"

@check(name="Environment consistency", datasets=["staging", "production"])
def compare_environments(mp: MetricProvider, ctx: Context) -> None:
    # Ensure staging matches production
    prod_total = mp.sum("revenue", dataset="production")
    staging_total = mp.sum("revenue", dataset="staging")

    ctx.assert_that(prod_total).where(name="Prod-staging revenue match").is_eq(
        staging_total, tol=0.01
    )  # 1% tolerance

Core Concepts

Just three simple ideas:

  1. Checks = Your validation logic (decorated functions)
  2. Metrics = What you measure (sum, average, null_count)
  3. Assertions = Your rules (is_between, is_positive, is_eq)

Common Patterns

Validate Percentages

# Check null rate
null_rate = mp.null_count("email") / mp.num_rows()
ctx.assert_that(null_rate).where(name="Email completeness").is_lt(0.1)  # <10%

# Check category distribution
fraud_rate = mp.count_values("status", "fraud") / mp.num_rows()
ctx.assert_that(fraud_rate).where(name="Fraud rate", severity="P0").is_lt(0.001)

# Check cardinality
unique_users = mp.unique_count("user_id")
total_orders = mp.num_rows()
ctx.assert_that(unique_users / total_orders).where(name="User diversity").is_gt(
    0.3
)  # >30% unique users

Monitor Complex Metrics

# Standard deviation over time
tax_volatility = mp.ext.stddev(mp.average("tax_rate"), lag=1, n=30)
ctx.assert_that(tax_volatility).where(name="Tax rate stability").is_lt(0.05)

# Multi-step calculations
conversion_rate = mp.count_values("converted", True) / mp.count_values("visited", True)
ctx.assert_that(conversion_rate).where(name="Conversion rate").is_between(0.02, 0.10)

Detect Anomalies

# Spike detection
max_payment = mp.maximum("payment_amount")
avg_payment = mp.average("payment_amount")
spike_ratio = max_payment / avg_payment

ctx.assert_that(spike_ratio).where(name="Payment spike detection", severity="P1").is_lt(
    100
)  # Max should be <100x average

Development & Tooling

Setup

git clone <repo>
cd dqx
./bin/setup-dev-env.sh  # One-time setup

Daily Workflow

# Run tests with coverage
uv run pytest --cov=dqx

# Check code quality (auto-fixes many issues)
uv run hooks

# Commit with conventional format
uv run cz commit  # Interactive commit helper

# Clean up temporary files
uv run cleanup

Key Tools

  • uv - Fast Python package & environment manager
  • pytest - Testing with 100% coverage requirement
  • ruff - Lightning-fast Python linter
  • mypy - Static type checking
  • commitizen - Conventional commit enforcement

Quick Reference

Metrics Available

Metric Description Example
num_rows() Total row count Basic count
null_count(col) Null values in column Data completeness
sum(col) Sum of values Totals
average(col) Mean value Averages
minimum(col) / maximum(col) Min/max values Range checks
duplicate_count([cols]) Duplicate rows Uniqueness
count_values(col, value) Count specific values Categories
unique_count(col) Distinct values Cardinality

Assertions Available

Assertion Description Example
is_eq(value, tol) Equals (with tolerance) Exact match
is_between(min, max) In range (inclusive) Valid ranges
is_positive() / is_zero() Sign checks Amounts
is_gt(val) / is_geq(val) Greater than (or equal) Thresholds
is_lt(val) / is_leq(val) Less than (or equal) Limits

Time-based Analysis

# Yesterday's value
mp.sum("revenue", lag=1)

# Last week's value
mp.sum("revenue", lag=7)

# Day-over-day change
mp.ext.day_over_day(mp.sum("revenue"))

# Week-over-week change
mp.ext.week_over_week(mp.sum("revenue"))

Installation Options

# Basic installation
pip install dqx

# With PostgreSQL support
pip install "dqx[postgres]"

# Development setup
pip install "dqx[dev]"

Learn More

License

MIT License. See LICENSE for details.


DQX makes data quality a first-class concern. Stop writing SQL. Start validating data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dqlib-0.5.2.tar.gz (70.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dqlib-0.5.2-py3-none-any.whl (82.2 kB view details)

Uploaded Python 3

File details

Details for the file dqlib-0.5.2.tar.gz.

File metadata

  • Download URL: dqlib-0.5.2.tar.gz
  • Upload date:
  • Size: 70.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dqlib-0.5.2.tar.gz
Algorithm Hash digest
SHA256 cda0b04d99fa7eb3831a03dee84d858c49dfdba9381aebcefccdafb8282b7a6c
MD5 ce377676317d6190fa5e6b354d64a811
BLAKE2b-256 05e69ab21aac31649db61532cecd12c354ec36b1ffd133122284ab57f5307b6d

See more details on using hashes here.

File details

Details for the file dqlib-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: dqlib-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 82.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dqlib-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 24627feb09338320b6e5cecf7731f4d83766a70c524cc0fcbd0d36500df922a2
MD5 5281f47c6511db7a0c55f1baf7448075
BLAKE2b-256 6f995c0b0ed84708611b0e48ee475c6b7ee20dd3c3dc5008f7ea95e778d18377

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page