Skip to main content

Data Quality checks using SQL

Project description

SQLDQ

SQLDQ is a Data Quality Testing library that keeps it simple and flexible.

Simply use SQL to define your checks.

Support

You can run data quality checks on:

  • In-memory:
    • Pandas (.from_duckdb)
    • Polars (.from_duckdb)
    • Pyspark (.from_pyspark), requires pyspark>=3.4.0
  • Remotely, only results are collected:
    • Postgres (.from_postgresql)
    • AWS Athena (.from_athena)
  • Everything else supported by DuckDB

Installation

pip install sqldq / uv pip install sqldq

You also need to install the corresponding libraries for your backend of choice, .e.g. duckdb when using .from_duckdb. sqldq`s error messages will also inform you about missing dependencies.

Examples

To see all of its features and examples for all supported backends, see the demo folder.

The basic workflow is as follows:

from sqldq import SQLDQ
import duckdb
import polars as pl

# Sample data
df_users = pl.DataFrame({
    "user_id": [1, 2, 2],          # Duplicate user_id=2
    "age": [25, 150, 45],          # Age 150 is an unplausible outlier
    "email": ["user1@example.com",
              "user2@example.com",
              "invalid-email"],    # Invalid email
})

# Connect via DuckDB
con = duckdb.connect()
con.register("users", df_users)

dq = SQLDQ.from_duckdb(connection=con)

# Define DQ checks
dq = (
    dq.add_check(
        name="check_duplicate_user_id",
        failure_rows_query="""
            WITH duplicate_users AS (
                SELECT user_id, COUNT(*) AS count
                FROM users
                GROUP BY user_id
            )
                SELECT user_id
                FROM duplicate_users
                WHERE count > 1""")
    .add_check(
        name="check_invalid_email",
        failure_rows_query="""
            SELECT user_id
            FROM users
            WHERE email NOT LIKE '%_@__%.__%'
        """)
    .add_check(
        name="check_age_outlier",
        failure_rows_query="""
            SELECT user_id, age
            FROM users
            WHERE age NOT BETWEEN 0 AND 120"""))

# Run checks
result = dq.execute()

# Report on results
report = result.report(include_rows=True,
                       include_summary_header=True,
                       fail_only=True)
print(report)

# Control flow
if result.has_failures():
    print("Checks failed. here we can take custom actions.")

Development

# open devcontainer
uv sync
source .venv/bin/activate
make check

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqldq-0.2.5.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqldq-0.2.5-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file sqldq-0.2.5.tar.gz.

File metadata

  • Download URL: sqldq-0.2.5.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for sqldq-0.2.5.tar.gz
Algorithm Hash digest
SHA256 72038ebe2f136aa89cf1e6a7ac4a90efd0d2e587d98551466542c5668aa1f4c7
MD5 b47fdbe8da2427b66ea7b5d997cbe83f
BLAKE2b-256 872319df57a9971f985987c2ebf2933e4b4f392185b7293ef07fb755fc3d863f

See more details on using hashes here.

File details

Details for the file sqldq-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: sqldq-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for sqldq-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 bf1d4ccdf4bbfd8e9d5ddaa2dd69e28c78462d3d52af1c32140923abe13bc686
MD5 f18d19b73c61171150be094b7853076e
BLAKE2b-256 fa9beb61aac128a0759c385b56e2e8bc54920a0c9bd9bda48fbed9b36d2a3dce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page