A declarative PySpark framework for row- and aggregate-level data quality validation.

These details have not been verified by PyPI

Project links

Project description

SparkDQ — Data Quality Validation for Apache Spark

SparkDQ is a lightweight data quality framework built natively for PySpark. You describe what valid data looks like — declaratively via YAML/JSON or through a type-safe Python API — and it validates your DataFrame at row and aggregate level in a single pass.

Its defining trait is what it leaves out. SparkDQ is intentionally small in scope and low in complexity — a focused set of checks and a single-pass engine, with no metadata store, orchestration layer, or profiling engine to operate. For most pipelines, that is exactly enough, and the reduced complexity is a feature in itself. That focus is what sets it apart: no JVM bridge like PyDeequ, no complexity overhead like Great Expectations, and no platform lock-in like Databricks dqx.

One dependency. No wrappers. No bloat.

Quickstart

Declarative — checks are passed as dicts, loaded from anywhere: YAML files, JSON, databases, or APIs:

from pyspark.sql import SparkSession
from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": None},
        {"id": 3, "name": "Bob"},
    ]
)

check_set = CheckSet()
check_set.add_checks_from_dicts([
    {"check": "null-check", "check-id": "no-null-name", "columns": ["name"]},
])

result = BatchDQEngine(check_set).run_batch(df)
print(result.summary())
# Validation Summary (2024-01-01 00:00:00)
# Total records:   3
# Passed records:  2
# Failed records:  1
# Warnings:        0
# Pass rate:       67.00%

Python-native — full type safety and IDE autocompletion:

from pyspark.sql import SparkSession
from sparkdq.checks import NullCheckConfig
from sparkdq.core import Severity
from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": None},
        {"id": 3, "name": "Bob"},
    ]
)

check_set = (
    CheckSet()
    .add_check(NullCheckConfig(check_id="no-null-name", columns=["name"], severity=Severity.CRITICAL))
)

result = BatchDQEngine(check_set).run_batch(df)
print(result.summary())
# Validation Summary (2024-01-01 00:00:00)
# Total records:   3
# Passed records:  2
# Failed records:  1
# Warnings:        0
# Pass rate:       67.00%

SparkDQ ships with 30+ built-in checks across null validation, numeric ranges, string patterns, date boundaries, schema enforcement, uniqueness, and referential integrity.

🚀 See the official documentation to learn more.

Installation

For Local Development / Standalone Clusters

Install with PySpark included:

pip install sparkdq[spark]

For Databricks / Managed Platforms

Install without PySpark (runtime provided by platform):

pip install sparkdq

The framework supports Python 3.11+ and is fully tested with PySpark 3.5.x. SparkDQ will automatically check for PySpark availability on import and provide clear error messages if PySpark is missing in your environment.

Why SparkDQ?

Small on purpose: A focused scope and low complexity — quick to learn, hard to misconfigure, and enough for most pipelines
Extensible by design: Add custom checks via a simple plugin system — no changes to the core required
Declarative or Pythonic: YAML/JSON configs or type-safe Python — your choice
Severity-aware: Distinguish between hard failures (CRITICAL) and soft constraints (WARNING)
Row-level and aggregate: Validate individual records and entire datasets in a single pass
Minimal footprint: Only Pydantic required — PySpark is provided by your platform

Let’s Build Better Data Together

⭐️ Found this useful? Give it a star and help spread the word!

📣 Questions, feedback, or ideas? Open an issue or discussion — we’d love to hear from you.

🤝 Want to contribute? Check out CONTRIBUTING.md to get started.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.12.1

Jul 4, 2026

0.12.0

Jun 7, 2026

0.11.4

Jun 4, 2026

0.11.3

May 29, 2026

0.11.2

May 29, 2026

0.11.1

May 29, 2026

0.11.0

Aug 9, 2025

0.10.0

Jun 1, 2025

0.9.0

May 28, 2025

0.8.1

May 27, 2025

0.8.0

May 27, 2025

0.7.1

May 25, 2025

0.7.0

May 21, 2025

0.6.1

May 19, 2025

0.6.0

May 17, 2025

0.5.2

May 13, 2025

0.5.1

May 12, 2025

0.5.0

May 11, 2025

0.4.0

May 1, 2025

0.3.0

May 1, 2025

0.2.1

Apr 29, 2025

0.1.2

Apr 27, 2025

0.1.1

Apr 27, 2025

0.1.0

Apr 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkdq-0.12.1.tar.gz (279.5 kB view details)

Uploaded Jul 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparkdq-0.12.1-py3-none-any.whl (95.6 kB view details)

Uploaded Jul 4, 2026 Python 3

File details

Details for the file sparkdq-0.12.1.tar.gz.

File metadata

Download URL: sparkdq-0.12.1.tar.gz
Upload date: Jul 4, 2026
Size: 279.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for sparkdq-0.12.1.tar.gz
Algorithm	Hash digest
SHA256	`cbffbceceadf0690be803b21b789ad3bef4bbee7ea0eaa1af98b14dc8265e9d5`
MD5	`1dc06d7ca4fe1aa8b89a21a6457c5a46`
BLAKE2b-256	`bef95511607913b9e0419187f58629b57bc7d196cfb37a549a02e27bdd081fa6`

See more details on using hashes here.

File details

Details for the file sparkdq-0.12.1-py3-none-any.whl.

File metadata

Download URL: sparkdq-0.12.1-py3-none-any.whl
Upload date: Jul 4, 2026
Size: 95.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for sparkdq-0.12.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bc01e27c8ea3762a0e639d590780f00fff6489c39537f02f0438db94df0b1377`
MD5	`38931fac89f7ff62f30965f1077f87d9`
BLAKE2b-256	`b812785f27bc35198462127506aad80a85535fde7d7a4792f95437caa60601c5`

See more details on using hashes here.

sparkdq 0.12.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SparkDQ — Data Quality Validation for Apache Spark

Quickstart

Installation

For Local Development / Standalone Clusters

For Databricks / Managed Platforms

Why SparkDQ?

Let’s Build Better Data Together

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes