Data quality checks for Databricks — with a built-in notebook UI

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

DashDQ — Data Quality for Databricks

DashDQ is a Databricks-native data quality library. It provides an interactive notebook wizard and a Python API to run 60+ production-ready checks on Unity Catalog tables — built entirely on PySpark and the Databricks SDK. No Great Expectations or external DQ frameworks required.

Part of the Dashlibs suite for Databricks — built by Darshan Shah.

Features

60+ native checks — completeness, accuracy, integrity, consistency — pure PySpark, zero external DQ library
Interactive wizard (configure()) — 3-tab notebook UI powered by ipywidgets
GE-compatible naming — check names follow Great Expectations conventions
DQX-inspired checks — freshness, email/URL/IPv4/UUID validation, custom SQL filter
Cross-table checks — foreign key validation, referential integrity, row count comparison across tables
Composite key checks — multi-column PK and uniqueness validation
Custom SQL filter — write a SQL WHERE clause; rows matching it are marked FAILED
Flexible output — Delta table, Databricks Volume (JSON/CSV), or DataFrame
Summary output — 1 row per check × column with pass/fail counts, %, column coverage, and metadata headers

Install

# Inside a Databricks notebook
%pip install dash-dq
dbutils.library.restartPython()

# Locally (Python 3.9+)
pip install dash-dq

Quickstart

Option 1 — 2-cell wizard

# Cell 1: open the configuration wizard
import dashdq
config = dashdq.configure()

The wizard opens below the cell. Enter your table name → Load Columns → add checks → choose output → Save Config.

# Cell 2: run checks (after clicking Save Config above)
report = dashdq.run_checks(config)

Option 2 — all-in-one

import dashdq
dashdq.launch()

Option 3 — pure Python API (no UI)

import dashdq

config = {
    "source": {"table": "catalog.schema.dim_customer"},
    "metadata": {
        "data_owner":      "Jane Smith",
        "data_steward":    "John Doe",
        "business_domain": "Finance",
        "description":     "Customer master data quality run",
    },
    "checks": [
        # Completeness
        {"check_name": "expect_column_values_to_not_be_null",
         "column": "customer_id", "threshold_pct": 100.0, "params": {}},

        # Accuracy — range
        {"check_name": "expect_column_values_to_be_between",
         "column": "age", "threshold_pct": 99.0,
         "params": {"min_value": 18, "max_value": 120}},

        # Accuracy — format
        {"check_name": "expect_column_values_to_be_valid_email",
         "column": "email_address", "threshold_pct": 98.0, "params": {}},

        # Accuracy — freshness
        {"check_name": "expect_column_values_to_be_not_older_than_n_days",
         "column": "last_updated", "threshold_pct": 100.0,
         "params": {"n_days": 7}},

        # Accuracy — custom SQL filter
        {"check_name": "expect_column_values_to_pass_custom_sql_filter",
         "column": "annual_income_aed", "threshold_pct": 100.0,
         "params": {"sql_filter": "annual_income_aed < 0 OR annual_income_aed > 100000000"}},

        # Integrity — primary key
        {"check_name": "expect_primary_key_to_be_valid",
         "column": "_COMPOUND_", "threshold_pct": 100.0,
         "params": {"columns": ["customer_id"]}},

        # Integrity — foreign key
        {"check_name": "expect_column_values_to_exist_in_reference_table",
         "column": "branch_id", "threshold_pct": 100.0,
         "params": {
             "reference_table": "catalog.schema.dim_branch",
             "reference_column": "branch_id",
         }},

        # Consistency — table level
        {"check_name": "expect_table_row_count_to_be_between",
         "column": "_TABLE_LEVEL_", "threshold_pct": 100.0,
         "params": {"min_value": 10000, "max_value": 100000}},
    ],
    "output": {
        "type": "delta",
        "delta_table": "catalog.schema.dq_results",
    },
}

report = dashdq.run_checks(config)
report.display()
print(report.summary())
# {'total_checks': 8, 'passed': 8, 'failed': 0, 'pass_rate_pct': 100.0}

Screenshots

Note: Replace these placeholders with screenshots from your Databricks workspace.

Tab 1 — Source & Metadata

Tab 1 — Source and Metadata

Enter a fully-qualified table name (e.g. ai_innovation_gold_dev.sdh.dim_customer), click Load Columns to pull the column list from Spark, and optionally fill in Data Owner, Data Steward, Business Domain, and Description fields.

Tab 2 — Checks Builder

Tab 2 — Checks builder

Select a column, pick a check from the dropdown (grouped by DQ dimension), set a pass threshold %, fill in parameters (fields appear dynamically), and click ＋ Add Check. A live table shows all configured checks with coloured dimension badges.

Tab 3 — Output

Choose between DataFrame only, Delta Table, Volume — JSON, or Volume — CSV. The filename is auto-suggested as dq_{table}_{date}.

Results

One row per check × column combination, with a summary banner showing total pass/fail counts and pass rate %.

To add screenshots: run dashdq.configure() in your Databricks workspace, take screenshots of each tab, and save them to docs/screenshots/.

Output schema

Column	Type	Description
`table_name`	string	Fully qualified source table
`column_name`	string	Column checked (`_TABLE_LEVEL_` or `_COMPOUND_` for multi-column checks)
`check_name`	string	GE-style check name
`dq_dimension`	string	Completeness / Accuracy / Integrity / Consistency
`total_rows`	int	Total rows evaluated
`passed_rows`	int	Rows that passed
`failed_rows`	int	Rows that failed
`passed_pct`	float	`passed_rows / total_rows × 100`
`threshold_pct`	float	Minimum pass % required (FAIL if below)
`status`	string	`PASS` or `FAIL`
`check_params`	string	JSON string of check parameters used
`run_timestamp`	string	ISO timestamp of the run
`data_owner`	string	Optional metadata header
`data_steward`	string	Optional metadata header
`business_domain`	string	Optional metadata header
`table_description`	string	Optional metadata header
`columns_checked`	int	Distinct columns covered by checks in this run
`total_columns`	int	Total columns in the source table
`column_coverage_pct`	float	`columns_checked / total_columns × 100`

Check catalog (60+)

Completeness (5)

Check	Params	Description
`expect_column_values_to_not_be_null`	—	Values must not be null
`expect_column_values_to_be_null`	—	Values must be null
`expect_column_values_to_not_be_null_or_empty`	—	Not null AND not empty/whitespace
`expect_column_null_count_to_be_between`	min_value, max_value	Null count in range
`expect_column_null_proportion_to_be_between`	min_value, max_value	Null proportion (0–1) in range

Accuracy — value (13)

Check	Params	Description
`expect_column_values_to_be_between`	min_value, max_value	Values in numeric range
`expect_column_values_to_not_be_between`	min_value, max_value	Values outside numeric range
`expect_column_values_to_be_in_set`	value_set	Values in allowed list
`expect_column_values_to_not_be_in_set`	value_set	Values not in forbidden list
`expect_column_values_to_equal`	value	All values == constant
`expect_column_values_to_not_equal`	value	No values == constant
`expect_column_values_to_be_not_less_than`	min_value	Values >= min
`expect_column_values_to_be_not_greater_than`	max_value	Values <= max
`expect_column_values_to_be_positive`	—	Values > 0
`expect_column_values_to_be_negative`	—	Values < 0
`expect_column_values_to_be_non_negative`	—	Values >= 0
`expect_column_values_to_be_increasing`	—	Non-decreasing order
`expect_column_values_to_be_decreasing`	—	Non-increasing order

Accuracy — string & pattern (13)

Check	Params	Description
`expect_column_values_to_not_be_empty_string`	—	Not blank/whitespace
`expect_column_values_to_match_regex`	regex	Matches regex
`expect_column_values_to_not_match_regex`	regex	Does not match regex
`expect_column_values_to_match_regex_list`	regex_list	Matches any regex in list
`expect_column_values_to_match_like_pattern`	like_pattern	SQL LIKE pattern (%, _)
`expect_column_values_to_not_match_like_pattern`	like_pattern	Does not match LIKE pattern
`expect_column_value_lengths_to_be_between`	min_value, max_value	String length in range
`expect_column_value_lengths_to_equal`	value	Exact string length
`expect_column_values_to_be_of_type`	type_	dtype contains type string
`expect_column_values_to_be_in_type_list`	type_list	dtype in list
`expect_column_values_to_be_valid_email`	—	Valid email address
`expect_column_values_to_be_valid_url`	—	Valid HTTP/HTTPS URL
`expect_column_values_to_be_valid_ipv4`	—	Valid IPv4 address
`expect_column_values_to_be_valid_uuid`	—	Valid UUID
`expect_column_values_to_be_json_parseable`	—	Valid JSON string

Accuracy — date & time (7)

Check	Params	Description
`expect_column_values_to_match_strftime_format`	strftime_format	Date format match
`expect_column_values_to_be_dateutil_parseable`	—	Parseable as a date (multiple formats)
`expect_column_values_to_not_be_in_future`	—	Date is not in the future
`expect_column_values_to_be_not_older_than_n_days`	n_days	Date within last N days
`expect_column_values_to_not_be_in_near_future`	n_days	Not within next N days
`expect_column_data_to_be_fresh`	n_minutes	Latest value within N minutes of now
`expect_column_values_to_pass_custom_sql_filter`	sql_filter	Rows matching WHERE clause = FAILED

Accuracy — aggregate (11)

Check	Params	Description
`expect_column_mean_to_be_between`	min_value, max_value	Mean in range
`expect_column_median_to_be_between`	min_value, max_value	Median in range
`expect_column_stdev_to_be_between`	min_value, max_value	Std deviation in range
`expect_column_max_to_be_between`	min_value, max_value	Max in range
`expect_column_min_to_be_between`	min_value, max_value	Min in range
`expect_column_sum_to_be_between`	min_value, max_value	Sum in range
`expect_column_most_common_value_to_be_in_set`	value_set	Mode in allowed set
`expect_column_quantile_value_to_be_between`	quantile, min_value, max_value	Quantile in range
`expect_column_distinct_values_to_be_in_set`	value_set	All distinct values in set
`expect_column_distinct_values_to_contain_set`	value_set	Distinct values include all of set
`expect_column_distinct_values_to_equal_set`	value_set	Distinct values == set exactly

Integrity (10)

Check	Params	Description
`expect_column_values_to_be_unique`	—	No duplicates
`expect_column_unique_value_count_to_be_between`	min_value, max_value	Distinct count in range
`expect_column_proportion_of_unique_values_to_be_between`	min_value, max_value	Uniqueness ratio in range
`expect_column_pair_values_to_be_equal`	column_b	colA == colB row-wise
`expect_column_pair_values_a_to_be_greater_than_b`	column_b	colA > colB row-wise
`expect_column_pair_values_to_be_in_set`	column_b, valid_pairs	(colA, colB) pairs in allowed set
`expect_compound_columns_to_be_unique`	columns	Multi-column combination unique
`expect_primary_key_to_be_valid`	columns	PK: not null + unique (single or composite)
`expect_column_values_to_exist_in_reference_table`	reference_table, reference_column	Foreign key check
`expect_referential_integrity`	reference_table, reference_column, check_orphans	Full RI, optionally bidirectional

Consistency — table level (10)

Check	Params	Description
`expect_table_row_count_to_be_between`	min_value, max_value	Row count in range
`expect_table_row_count_to_equal`	value	Exact row count
`expect_table_column_count_to_be_between`	min_value, max_value	Column count in range
`expect_table_column_count_to_equal`	value	Exact column count
`expect_column_to_exist`	—	Column exists in table
`expect_table_columns_to_match_set`	column_set	Column names == expected set
`expect_table_columns_to_match_ordered_list`	column_list	Columns in exact order
`expect_multicolumn_sum_to_equal`	columns, sum_value	Sum across columns == value
`expect_table_row_count_to_equal_other_table`	reference_table	Row count == other table's count
`expect_table_schema_to_match`	expected_schema	Schema (names + types) match

DQ Dimensions

Dimension	Focus
Completeness	Are all required values present?
Accuracy	Are values correct, valid, and within expected ranges?
Integrity	Are relationships and constraints maintained?
Consistency	Is data consistent across columns, tables, and time?

Contributing

PRs require review — direct pushes to main are blocked on all Dashlibs repos.

git clone https://github.com/dash-libs/dash-dq
cd dash-dq
pip install -e ".[dev]"
pytest tests/ -v

License

Apache 2.0 — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dash-libs

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.16

Jun 29, 2026

This version

0.1.15

Jun 26, 2026

0.1.14

Jun 26, 2026

0.1.12

Jun 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dash_dq-0.1.15.tar.gz (866.5 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dash_dq-0.1.15-py3-none-any.whl (34.4 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file dash_dq-0.1.15.tar.gz.

File metadata

Download URL: dash_dq-0.1.15.tar.gz
Upload date: Jun 26, 2026
Size: 866.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dash_dq-0.1.15.tar.gz
Algorithm	Hash digest
SHA256	`01c26e78304c53e1785d31d967d9165a44b59db2b427c018406b5243124db7d0`
MD5	`3cedda415b6a57b96715d1aadd370832`
BLAKE2b-256	`4bcd13eab89ef331ad993a5536b5c4c010e231eac8e3aef13d251084b9c12d4a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dash_dq-0.1.15.tar.gz:

Publisher: release.yml on dash-libs/dash-dq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dash_dq-0.1.15.tar.gz
- Subject digest: 01c26e78304c53e1785d31d967d9165a44b59db2b427c018406b5243124db7d0
- Sigstore transparency entry: 1967711981
- Sigstore integration time: Jun 26, 2026
Source repository:
- Permalink: dash-libs/dash-dq@375eb537728519d8ec8c0c56c42774ecd0bccd02
- Branch / Tag: refs/heads/main
- Owner: https://github.com/dash-libs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@375eb537728519d8ec8c0c56c42774ecd0bccd02
- Trigger Event: workflow_dispatch

File details

Details for the file dash_dq-0.1.15-py3-none-any.whl.

File metadata

Download URL: dash_dq-0.1.15-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 34.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dash_dq-0.1.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f061169029dcfaf306c718be625f57eda468b2b4edbd418f72c401e4efebc8e0`
MD5	`2a9d3be11bcc90b802582d2027c529c6`
BLAKE2b-256	`fd898475fc7816b90eccd254a76c07880396da44edf942bf4a16dd7804bcbd6c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dash_dq-0.1.15-py3-none-any.whl:

Publisher: release.yml on dash-libs/dash-dq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dash_dq-0.1.15-py3-none-any.whl
- Subject digest: f061169029dcfaf306c718be625f57eda468b2b4edbd418f72c401e4efebc8e0
- Sigstore transparency entry: 1967712058
- Sigstore integration time: Jun 26, 2026
Source repository:
- Permalink: dash-libs/dash-dq@375eb537728519d8ec8c0c56c42774ecd0bccd02
- Branch / Tag: refs/heads/main
- Owner: https://github.com/dash-libs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@375eb537728519d8ec8c0c56c42774ecd0bccd02
- Trigger Event: workflow_dispatch

dash-dq 0.1.15

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

DashDQ — Data Quality for Databricks

Features

Install

Quickstart

Option 1 — 2-cell wizard

Option 2 — all-in-one

Option 3 — pure Python API (no UI)

Screenshots

Tab 1 — Source & Metadata

Tab 2 — Checks Builder

Tab 3 — Output

Results

Output schema

Check catalog (60+)

Completeness (5)

Accuracy — value (13)

Accuracy — string & pattern (13)

Accuracy — date & time (7)

Accuracy — aggregate (11)

Integrity (10)

Consistency — table level (10)

DQ Dimensions

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance