Data Quality checks using SQL
Project description
SQLDQ
SQLDQ is a Data Quality Testing library that keeps it simple and flexible.
Simply use SQL to define your checks.
Support
You can run data quality checks on:
- In-memory:
- Pandas (
.from_duckdb) - Polars (
.from_duckdb) - Pyspark (
.from_pyspark), requirespyspark>=3.4.0
- Pandas (
- Remotely, only results are collected:
- Postgres (
.from_postgresql) - AWS Athena (
.from_athena)
- Postgres (
- Everything else supported by DuckDB
Installation
pip install sqldq / uv pip install sqldq
You also need to install the corresponding libraries for your backend of choice, .e.g. duckdb when using .from_duckdb. sqldq`s error messages will also inform you about missing dependencies.
Examples
To see all of its features and examples for all supported backends, see the demo folder.
The basic workflow is as follows:
from sqldq import SQLDQ
import duckdb
import polars as pl
# Sample data
df_users = pl.DataFrame({
"user_id": [1, 2, 2], # Duplicate user_id=2
"age": [25, 150, 45], # Age 150 is an unplausible outlier
"email": ["user1@example.com",
"user2@example.com",
"invalid-email"], # Invalid email
})
# Connect via DuckDB
con = duckdb.connect()
con.register("users", df_users)
dq = SQLDQ.from_duckdb(connection=con)
# Define DQ checks
dq = (
dq.add_check(
name="check_duplicate_user_id",
failure_rows_query="""
WITH duplicate_users AS (
SELECT user_id, COUNT(*) AS count
FROM users
GROUP BY user_id
)
SELECT user_id
FROM duplicate_users
WHERE count > 1""")
.add_check(
name="check_invalid_email",
failure_rows_query="""
SELECT user_id
FROM users
WHERE email NOT LIKE '%_@__%.__%'
""")
.add_check(
name="check_age_outlier",
failure_rows_query="""
SELECT user_id, age
FROM users
WHERE age NOT BETWEEN 0 AND 120"""))
# Run checks
result = dq.execute()
# Report on results
report = result.report(include_rows=True,
include_summary_header=True,
fail_only=True)
print(report)
# Control flow
if result.has_failures():
print("Checks failed. here we can take custom actions.")
Development
# open devcontainer
uv sync
source .venv/bin/activate
make check
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sqldq-0.2.6.tar.gz.
File metadata
- Download URL: sqldq-0.2.6.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e954689d67a31a0f15697eced6dfb52bca298d4dfd9a269c84ab789f764d6f42
|
|
| MD5 |
bdd6fa3b3aa7170c75c5704ba4d341e2
|
|
| BLAKE2b-256 |
83688605bd73d563b237524054a461daa40c6d115d79f0b350fb00e8bb7da14e
|
File details
Details for the file sqldq-0.2.6-py3-none-any.whl.
File metadata
- Download URL: sqldq-0.2.6-py3-none-any.whl
- Upload date:
- Size: 12.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3012358fe5e8d118d97848d31abcf0b1ca5e4d15c65475ad7512edb15a8ffe9
|
|
| MD5 |
67708c4dfa9f3f75d0d65618b9e19e0c
|
|
| BLAKE2b-256 |
54f6bb07259e60d96834926b5845998d61cebf9cffdc7d067908b36cec7ce6e7
|