A declarative PySpark framework for row- and aggregate-level data quality validation.

These details have not been verified by PyPI

Project links

Project description

SparkDQ — Data Quality Validation for Apache Spark

Most data quality frameworks weren’t designed with PySpark in mind. They aren’t Spark-native and often lack proper support for declarative pipelines. Instead of integrating seamlessly, they require you to build custom wrappers around them just to fit into production workflows. This adds complexity and makes your pipelines harder to maintain. On top of that, many frameworks only validate data after processing — so you can’t react dynamically or fail early when data issues occur.

SparkDQ takes a different approach. It’s built specifically for PySpark — so you can define and run data quality checks directly inside your Spark pipelines, using Python. Whether you're validating incoming data, verifying outputs before persistence, or enforcing assumptions in your dataflow: SparkDQ helps you catch issues early, without adding complexity.

🚀 See the official documentation to learn more.

Quickstart Examples

SparkDQ lets you define checks either using a Python-native interface or via declarative configuration (e.g. YAML, JSON, or database-driven). Regardless of how you define them, all checks are added to a CheckSet — which you pass to the validation engine. That’s it! Choose the style that fits your use case, and SparkDQ takes care of the rest.

Python-Native Approach

from pyspark.sql import SparkSession

from sparkdq.checks import NullCheckConfig
from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    {"id": 1, "name": "Alice"},
    {"id": 2, "name": None},
    {"id": 3, "name": "Bob"},
])

# Define checks using the Python-native interface (no external config needed)
check_set = CheckSet()
check_set.add_check(NullCheckConfig(check_id="my-null-check", columns=["name"]))

result = BatchDQEngine(check_set).run_batch(df)
print(result.summary())

Declarative Approach

from pyspark.sql import SparkSession

from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": None},
        {"id": 3, "name": "Bob"},
    ]
)

# Declarative configuration via dictionary
# Could be loaded from YAML, JSON, or any external system
check_definitions = [
    {"check-id": "my-null-check", "check": "null-check", "columns": ["name"]},
]
check_set = CheckSet()
check_set.add_checks_from_dicts(check_definitions)

result = BatchDQEngine(check_set).run_batch(df)
print(result.summary())

SparkDQ is designed to integrate seamlessly into real-world systems. Instead of relying on a custom DSL or rigid schemas, it accepts plain Python dictionaries for check definitions. This makes it easy to load checks from YAML or JSON files, configuration tables in databases, or even remote APIs — enabling smooth integration into orchestration tools, CI pipelines, and data contract workflows.

Installation

Install the latest stable version using pip:

pip install sparkdq

Alternatively, if you're using uv, a fast and modern Python package manager:

uv add sparkdq

The framework supports Python 3.10+ and is fully tested with PySpark 3.5.x. No additional Spark installation is required when running inside environments like Databricks, AWS Glue, or EMR.

Why SparkDQ?

✅ Robust Validation Layer: Clean separation of check definition, execution, and reporting
✅ Declarative or Programmatic: Define checks via config files or directly in Python
✅ Severity-Aware: Built-in distinction between warning and critical violations
✅ Row & Aggregate Logic: Supports both record-level and dataset-wide constraints
✅ Typed & Tested: Built with type safety, testability, and extensibility in mind
✅ Zero Overhead: Pure PySpark, no heavy dependencies

Typical Use Cases

SparkDQ is built for modern data platforms that demand trust, transparency, and resilience. It helps teams enforce quality standards early and consistently — across ingestion, transformation, and delivery layers.

Whether you're building a real-time ingestion pipeline or curating a data product for thousands of downstream users, SparkDQ lets you define and execute checks that are precise, scalable, and easy to maintain.

Common Scenarios:

✅ Validating raw ingestion data
✅ Enforcing schema and content rules before persisting to a lakehouse (Delta, Iceberg, Hudi)
✅ Asserting quality conditions before analytics or ML training jobs
✅ Flagging critical violations in batch pipelines via structured summaries and alerts
✅ Driving Data Contracts: Use declarative checks in CI pipelines to catch issues before deployment

Let’s Build Better Data Together

⭐️ Found this useful? Give it a star and help spread the word!

📣 Questions, feedback, or ideas? Open an issue or discussion — we’d love to hear from you.

🤝 Want to contribute? Check out CONTRIBUTING.md to get started.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.11.0

Aug 9, 2025

0.10.0

Jun 1, 2025

0.9.0

May 28, 2025

0.8.1

May 27, 2025

0.8.0

May 27, 2025

0.7.1

May 25, 2025

0.7.0

May 21, 2025

0.6.1

May 19, 2025

0.6.0

May 17, 2025

0.5.2

May 13, 2025

0.5.1

May 12, 2025

This version

0.5.0

May 11, 2025

0.4.0

May 1, 2025

0.3.0

May 1, 2025

0.2.1

Apr 29, 2025

0.1.2

Apr 27, 2025

0.1.1

Apr 27, 2025

0.1.0

Apr 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkdq-0.5.0.tar.gz (39.3 kB view details)

Uploaded May 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparkdq-0.5.0-py3-none-any.whl (65.7 kB view details)

Uploaded May 11, 2025 Python 3

File details

Details for the file sparkdq-0.5.0.tar.gz.

File metadata

Download URL: sparkdq-0.5.0.tar.gz
Upload date: May 11, 2025
Size: 39.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.17

File hashes

Hashes for sparkdq-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`d8e3a89ce4ac7b5b52a1caa0bae4aafd642a25fd0c98fcda66703842564ede3f`
MD5	`b91ef0b53a224b00b6c91a6e5a1d354c`
BLAKE2b-256	`229c9302ca3223d7335b71bb2ed330cffa78bd874d379c0ac99aa626299b95cc`

See more details on using hashes here.

File details

Details for the file sparkdq-0.5.0-py3-none-any.whl.

File metadata

Download URL: sparkdq-0.5.0-py3-none-any.whl
Upload date: May 11, 2025
Size: 65.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.17

File hashes

Hashes for sparkdq-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2bb8fdd2c2dcd8f60a550f536b8932dab0a56e6433d0d10316231437171568ed`
MD5	`20630ce08bf08ebfb82846b38483c787`
BLAKE2b-256	`8f1d51f25e54f630ca28dae5eebadd17dd03ba62bff7bdc762cf97143eadb5ca`

See more details on using hashes here.

sparkdq 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SparkDQ — Data Quality Validation for Apache Spark

Quickstart Examples

Python-Native Approach

Declarative Approach

Installation

Why SparkDQ?

Typical Use Cases

Let’s Build Better Data Together

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes