Skip to main content

Data Quality eXcellence

Project description

DQX - Data Quality Excellence

Transform data validation into mathematical expressions. Find issues before they find you.

Python License Code style: black Documentation Status

Write quality checks as code, not SQL. Get instant feedback on millions of rows.

Why DQX?

Before: Complex SQL that's hard to maintain

WITH metrics AS (
  SELECT
    COUNT(*) as total,
    SUM(CASE WHEN price IS NULL THEN 1 ELSE 0 END) as null_prices,
    AVG(price) as avg_price
  FROM orders
)
SELECT
  CASE
    WHEN null_prices::float / total > 0.05 THEN 'FAIL: Too many nulls'
    WHEN avg_price < 10 THEN 'FAIL: Prices too low'
    ELSE 'PASS'
  END as result
FROM metrics;

After: Clear, testable validation logic

@check(name="Order quality")
def validate_orders(mp: MetricProvider, ctx: Context) -> None:
    null_rate = mp.null_count("price") / mp.num_rows()
    ctx.assert_that(null_rate).where(name="Price completeness").is_leq(0.05)
    ctx.assert_that(mp.average("price")).where(name="Price reasonableness").is_geq(10.0)

Quick Start

Note: Install the package as dqlib but import it as dqx

# Install from PyPI
pip install dqlib

Then validate your data:

import pyarrow as pa
import sympy as sp
from dqx.api import check, VerificationSuite, MetricProvider, Context
from dqx.common import ResultKey
from dqx.datasource import DuckRelationDataSource
from dqx.orm.repositories import InMemoryMetricDB


# Your business rules as code
@check(name="Revenue integrity")
def validate_revenue(mp: MetricProvider, ctx: Context) -> None:
    # Catch calculation errors
    calculated = mp.sum("price") * mp.sum("quantity")
    reported = mp.sum("revenue")
    error_rate = sp.Abs(calculated - reported) / reported

    ctx.assert_that(error_rate).where(
        name="Revenue calculation accuracy", severity="P0"
    ).is_lt(
        0.001
    )  # Less than 0.1% error


# Load your data
data = pa.Table.from_pydict(
    {"price": [10.5, 20.0, 15.5], "quantity": [2, 1, 3], "revenue": [21.0, 20.0, 46.5]}
)

# Run validation
db = InMemoryMetricDB()
suite = VerificationSuite([validate_revenue], db, "Daily validation")
suite.run([DuckRelationDataSource.from_arrow(data)], ResultKey())
# ✓ Revenue integrity: OK

What DQX Solves

Real examples from production data pipelines:

📊 Data Completeness

"Are critical fields populated?"

@check(name="Customer data quality")
def check_completeness(mp: MetricProvider, ctx: Context) -> None:
    # Flag if more than 5% of customer IDs are missing
    null_rate = mp.null_count("customer_id") / mp.num_rows()
    ctx.assert_that(null_rate).where(
        name="Customer ID completeness", severity="P0"
    ).is_lt(0.05)

📈 Trend Monitoring

"Did revenue drop unexpectedly?"

@check(name="Revenue stability")
def monitor_trends(mp: MetricProvider, ctx: Context) -> None:
    # Alert on >20% daily revenue changes
    daily_change = mp.sum("revenue") / mp.sum("revenue", lag=1)
    ctx.assert_that(daily_change).where(
        name="Daily revenue stability", severity="P0"
    ).is_between(0.8, 1.2)

    # Track week-over-week growth
    wow_change = mp.sum("revenue") / mp.sum("revenue", lag=7)
    ctx.assert_that(wow_change).where(name="Weekly revenue trend").is_geq(
        0.95
    )  # Allow 5% decline

🔍 Data Integrity

"Are there duplicate transactions?"

@check(name="Transaction integrity")
def check_integrity(mp: MetricProvider, ctx: Context) -> None:
    # No duplicate transaction IDs allowed
    ctx.assert_that(mp.duplicate_count(["transaction_id"])).where(
        name="Transaction uniqueness", severity="P0"
    ).is_eq(0)

    # Validate business rules
    ctx.assert_that(mp.minimum("quantity")).where(
        name="Positive quantities only"
    ).is_positive()

⚖️ Cross-Dataset Validation

"Do staging and production match?"

@check(name="Environment consistency", datasets=["staging", "production"])
def compare_environments(mp: MetricProvider, ctx: Context) -> None:
    # Ensure staging matches production
    prod_total = mp.sum("revenue", dataset="production")
    staging_total = mp.sum("revenue", dataset="staging")

    ctx.assert_that(prod_total).where(name="Prod-staging revenue match").is_eq(
        staging_total, tol=0.01
    )  # 1% tolerance

Core Concepts

Just three simple ideas:

  1. Checks = Your validation logic (decorated functions)
  2. Metrics = What you measure (sum, average, null_count)
  3. Assertions = Your rules (is_between, is_positive, is_eq)

Common Patterns

Validate Percentages

# Check null rate
null_rate = mp.null_count("email") / mp.num_rows()
ctx.assert_that(null_rate).where(name="Email completeness").is_lt(0.1)  # <10%

# Check category distribution
fraud_rate = mp.count_values("status", "fraud") / mp.num_rows()
ctx.assert_that(fraud_rate).where(name="Fraud rate", severity="P0").is_lt(0.001)

# Check cardinality
unique_users = mp.unique_count("user_id")
total_orders = mp.num_rows()
ctx.assert_that(unique_users / total_orders).where(name="User diversity").is_gt(
    0.3
)  # >30% unique users

Monitor Complex Metrics

# Standard deviation over time
tax_volatility = mp.ext.stddev(mp.average("tax_rate"), lag=1, n=30)
ctx.assert_that(tax_volatility).where(name="Tax rate stability").is_lt(0.05)

# Multi-step calculations
conversion_rate = mp.count_values("converted", True) / mp.count_values("visited", True)
ctx.assert_that(conversion_rate).where(name="Conversion rate").is_between(0.02, 0.10)

Detect Anomalies

# Spike detection
max_payment = mp.maximum("payment_amount")
avg_payment = mp.average("payment_amount")
spike_ratio = max_payment / avg_payment

ctx.assert_that(spike_ratio).where(name="Payment spike detection", severity="P1").is_lt(
    100
)  # Max should be <100x average

Development & Tooling

Setup

git clone <repo>
cd dqx
./bin/setup-dev-env.sh  # One-time setup

Daily Workflow

# Run tests with coverage
uv run pytest --cov=dqx

# Check code quality (auto-fixes many issues)
uv run hooks

# Commit with conventional format
uv run cz commit  # Interactive commit helper

# Clean up temporary files
uv run cleanup

Key Tools

  • uv - Fast Python package & environment manager
  • pytest - Testing with 100% coverage requirement
  • ruff - Lightning-fast Python linter
  • mypy - Static type checking
  • commitizen - Conventional commit enforcement

Quick Reference

Metrics Available

Metric Description Example
num_rows() Total row count Basic count
null_count(col) Null values in column Data completeness
sum(col) Sum of values Totals
average(col) Mean value Averages
minimum(col) / maximum(col) Min/max values Range checks
duplicate_count([cols]) Duplicate rows Uniqueness
count_values(col, value) Count specific values Categories
unique_count(col) Distinct values Cardinality

Assertions Available

Assertion Description Example
is_eq(value, tol) Equals (with tolerance) Exact match
is_between(min, max) In range (inclusive) Valid ranges
is_positive() / is_zero() Sign checks Amounts
is_gt(val) / is_geq(val) Greater than (or equal) Thresholds
is_lt(val) / is_leq(val) Less than (or equal) Limits

Time-based Analysis

# Yesterday's value
mp.sum("revenue", lag=1)

# Last week's value
mp.sum("revenue", lag=7)

# Day-over-day change
mp.ext.day_over_day(mp.sum("revenue"))

# Week-over-week change
mp.ext.week_over_week(mp.sum("revenue"))

Installation Options

Note: The package is published on PyPI as dqlib but imported as dqx in your code.

# Basic installation
pip install dqlib

# With PostgreSQL support
pip install "dqlib[postgres]"

# Development setup
pip install "dqlib[dev]"

After installation, import the package as:

import dqx

Learn More

License

MIT License. See LICENSE for details.


DQX makes data quality a first-class concern. Stop writing SQL. Start validating data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dqlib-0.5.4.tar.gz (70.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dqlib-0.5.4-py3-none-any.whl (82.3 kB view details)

Uploaded Python 3

File details

Details for the file dqlib-0.5.4.tar.gz.

File metadata

  • Download URL: dqlib-0.5.4.tar.gz
  • Upload date:
  • Size: 70.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dqlib-0.5.4.tar.gz
Algorithm Hash digest
SHA256 04be497b540f8c08ee6174edefc7bab2a14ea442d14edc6b28ce73c844c51764
MD5 9e93909ac0bd3de28966df7349bcbb5f
BLAKE2b-256 f1054d0622917162a1fb12a4f0e149b2377e87c1b9beba7a0cb791df6e905b56

See more details on using hashes here.

File details

Details for the file dqlib-0.5.4-py3-none-any.whl.

File metadata

  • Download URL: dqlib-0.5.4-py3-none-any.whl
  • Upload date:
  • Size: 82.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dqlib-0.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5e35790625514d8b8da0638ba6e734d684ec595e3698ff9fc0337d7a1eec7dcd
MD5 ab0f689bc65cd0e6b2a023e2553ea676
BLAKE2b-256 49fe7ebdbb79fd9ea02ea6c38a9c5e1aa60f28a7ca8f5a98c8b133eb14e0feda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page