Skip to main content

Add your description here

Project description

SQLDQ

SQLDQ is a Data Quality Check library that keeps it simple.

Support

You can run data quality checks on:

  • In-memory:
    • Pandas (.from_duckdb)
    • Polars (.from_duckdb)
    • Pyspark (.from_pyspark)
  • Remotely, only results are collected:
    • Postgres (.from_postgresql)
    • AWS Athena (.from_athena)
  • Everything else supported by DuckDB

Examples

To see all of its features and examples for all supported backends, see the demo folder.

The basic workflow is as follows:

from sqldq import SQLDQ
import duckdb
import polars as pl

# Sample data
df_users = pl.DataFrame({
    "user_id": [1, 2, 2],     # Duplicate user_id 4
    "age": [25, 150, 45],   # Age 150 is outlier
    "email": ["user1@example.com",
              "user2@example.com",
              "invalid-email"],  # Invalid email
})

# Connect via DuckDB
con = duckdb.connect()
con.register("users", df_users)

dq = SQLDQ.from_duckdb(connection=con)

# Define DQ checks
dq = (
    dq.add_check(
        name="check_duplicate_user_id",
        failure_rows_query="""
            WITH duplicate_users AS (
                SELECT user_id, COUNT(*) AS count
                FROM users
                GROUP BY user_id
            )
                SELECT user_id
                FROM duplicate_users
                WHERE count > 1""")
    .add_check(
        name="check_invalid_email",
        failure_rows_query="""
            SELECT user_id
            FROM users
            WHERE email NOT LIKE '%_@__%.__%'
        """)
    .add_check(
        name="check_age_outlier",
        failure_rows_query="""
            SELECT user_id, age
            FROM users
            WHERE age NOT BETWEEN 0 AND 120"""))

# Run checks
result = dq.execute()

# Report on results
report = result.report(include_rows=True,
                       include_summary_header=True,
                       fail_only=True)
print(report)

# Control flow
if result.has_failures():
    print("Checks failed. here we can take custom actions.")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqldq-0.2.4.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqldq-0.2.4-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file sqldq-0.2.4.tar.gz.

File metadata

  • Download URL: sqldq-0.2.4.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for sqldq-0.2.4.tar.gz
Algorithm Hash digest
SHA256 891aca621ad3f14837d39a8f1f0f3488e74828356c9b21328ca9aea4e42f86c5
MD5 f689eb48a8ca6b7f5d9ff2d803faf730
BLAKE2b-256 b68e5b79ba19823a7a0f1ee1d362d320678567af288756840303c02cb60fade8

See more details on using hashes here.

File details

Details for the file sqldq-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: sqldq-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for sqldq-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 25169670754b7d879927aecdc1e6e16559dbc3ac9e0b713ae72eb51a95d48baf
MD5 dae34532278fae815a90bcdc867e7634
BLAKE2b-256 75aef865bc9a0b9e699e07859799c4e19aa5f8592f06494219fff396a7626d5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page