Skip to main content

Data Quality checks using SQL

Project description

SQLDQ

SQLDQ is a Data Quality Testing library that keeps it simple and flexible.

Simply use SQL to define your checks.

Support

You can run data quality checks on:

  • In-memory:
    • Pandas (.from_duckdb)
    • Polars (.from_duckdb)
    • Pyspark (.from_pyspark), requires pyspark>=3.4.0
  • Remotely, only results are collected:
    • Postgres (.from_postgresql)
    • AWS Athena (.from_athena)
  • Everything else supported by DuckDB

Installation

pip install sqldq / uv pip install sqldq

You also need to install the corresponding libraries for your backend of choice, .e.g. duckdb when using .from_duckdb. sqldq`s error messages will also inform you about missing dependencies.

Examples

To see all of its features and examples for all supported backends, see the demo folder.

The basic workflow is as follows:

from sqldq import SQLDQ
import duckdb
import polars as pl

# Sample data
df_users = pl.DataFrame({
    "user_id": [1, 2, 2],          # Duplicate user_id=2
    "age": [25, 150, 45],          # Age 150 is an unplausible outlier
    "email": ["user1@example.com",
              "user2@example.com",
              "invalid-email"],    # Invalid email
})

# Connect via DuckDB
con = duckdb.connect()
con.register("users", df_users)

dq = SQLDQ.from_duckdb(connection=con)

# Define DQ checks
dq = (
    dq.add_check(
        name="check_duplicate_user_id",
        failure_rows_query="""
            WITH duplicate_users AS (
                SELECT user_id, COUNT(*) AS count
                FROM users
                GROUP BY user_id
            )
                SELECT user_id
                FROM duplicate_users
                WHERE count > 1""")
    .add_check(
        name="check_invalid_email",
        failure_rows_query="""
            SELECT user_id
            FROM users
            WHERE email NOT LIKE '%_@__%.__%'
        """)
    .add_check(
        name="check_age_outlier",
        failure_rows_query="""
            SELECT user_id, age
            FROM users
            WHERE age NOT BETWEEN 0 AND 120"""))

# Run checks
result = dq.execute()

# Report on results
report = result.report(include_rows=True,
                       include_summary_header=True,
                       fail_only=True)
print(report)

# Control flow
if result.has_failures():
    print("Checks failed. here we can take custom actions.")

Development

# open devcontainer
uv sync
source .venv/bin/activate
make check

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqldq-0.2.6.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqldq-0.2.6-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file sqldq-0.2.6.tar.gz.

File metadata

  • Download URL: sqldq-0.2.6.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for sqldq-0.2.6.tar.gz
Algorithm Hash digest
SHA256 e954689d67a31a0f15697eced6dfb52bca298d4dfd9a269c84ab789f764d6f42
MD5 bdd6fa3b3aa7170c75c5704ba4d341e2
BLAKE2b-256 83688605bd73d563b237524054a461daa40c6d115d79f0b350fb00e8bb7da14e

See more details on using hashes here.

File details

Details for the file sqldq-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: sqldq-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for sqldq-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 c3012358fe5e8d118d97848d31abcf0b1ca5e4d15c65475ad7512edb15a8ffe9
MD5 67708c4dfa9f3f75d0d65618b9e19e0c
BLAKE2b-256 54f6bb07259e60d96834926b5845998d61cebf9cffdc7d067908b36cec7ce6e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page