Add your description here
Project description
SQLDQ
SQLDQ is a Data Quality Check library that keeps it simple.
Support
You can run data quality checks on:
- In-memory:
- Pandas (
.from_duckdb) - Polars (
.from_duckdb) - Pyspark (
.from_pyspark)
- Pandas (
- Remotely, only results are collected:
- Postgres (
.from_postgresql) - AWS Athena (
.from_athena)
- Postgres (
- Everything else supported by DuckDB
Examples
To see all of its features and examples for all supported backends, see the demo folder.
The basic workflow is as follows:
from sqldq import SQLDQ
import duckdb
import polars as pl
# Sample data
df_users = pl.DataFrame({
"user_id": [1, 2, 2], # Duplicate user_id 4
"age": [25, 150, 45], # Age 150 is outlier
"email": ["user1@example.com",
"user2@example.com",
"invalid-email"], # Invalid email
})
# Connect via DuckDB
con = duckdb.connect()
con.register("users", df_users)
dq = SQLDQ.from_duckdb(connection=con)
# Define DQ checks
dq = (
dq.add_check(
name="check_duplicate_user_id",
failure_rows_query="""
WITH duplicate_users AS (
SELECT user_id, COUNT(*) AS count
FROM users
GROUP BY user_id
)
SELECT user_id
FROM duplicate_users
WHERE count > 1""")
.add_check(
name="check_invalid_email",
failure_rows_query="""
SELECT user_id
FROM users
WHERE email NOT LIKE '%_@__%.__%'
""")
.add_check(
name="check_age_outlier",
failure_rows_query="""
SELECT user_id, age
FROM users
WHERE age NOT BETWEEN 0 AND 120"""))
# Run checks
result = dq.execute()
# Report on results
report = result.report(include_rows=True,
include_summary_header=True,
fail_only=True)
print(report)
# Control flow
if result.has_failures():
print("Checks failed. here we can take custom actions.")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sqldq-0.2.4.tar.gz
(13.0 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
sqldq-0.2.4-py3-none-any.whl
(12.1 kB
view details)
File details
Details for the file sqldq-0.2.4.tar.gz.
File metadata
- Download URL: sqldq-0.2.4.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
891aca621ad3f14837d39a8f1f0f3488e74828356c9b21328ca9aea4e42f86c5
|
|
| MD5 |
f689eb48a8ca6b7f5d9ff2d803faf730
|
|
| BLAKE2b-256 |
b68e5b79ba19823a7a0f1ee1d362d320678567af288756840303c02cb60fade8
|
File details
Details for the file sqldq-0.2.4-py3-none-any.whl.
File metadata
- Download URL: sqldq-0.2.4-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25169670754b7d879927aecdc1e6e16559dbc3ac9e0b713ae72eb51a95d48baf
|
|
| MD5 |
dae34532278fae815a90bcdc867e7634
|
|
| BLAKE2b-256 |
75aef865bc9a0b9e699e07859799c4e19aa5f8592f06494219fff396a7626d5e
|