Skip to main content

A declarative PySpark framework for row- and aggregate-level data quality validation.

Project description

CI Pipeline codecov Docs PyPI version Python Versions License: Apache-2.0

SparkDQ — Data Quality Validation for Apache Spark

Most data quality frameworks weren’t designed with PySpark in mind. They aren’t Spark-native and often lack proper support for declarative pipelines. Instead of integrating seamlessly, they require you to build custom wrappers around them just to fit into production workflows. This adds complexity and makes your pipelines harder to maintain. On top of that, many frameworks only validate data after processing — so you can’t react dynamically or fail early when data issues occur.

SparkDQ takes a different approach. It’s built specifically for PySpark — so you can define and run data quality checks directly inside your Spark pipelines, using Python. Whether you're validating incoming data, verifying outputs before persistence, or enforcing assumptions in your dataflow: SparkDQ helps you catch issues early, without adding complexity.

🚀 See the official documentation to learn more.

Quickstart Examples

Define checks as dictionaries that can be loaded from YAML/JSON files, stored in databases, or generated by APIs — perfect for CI/CD pipelines and data contracts.

from pyspark.sql import SparkSession

from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": None},
        {"id": 3, "name": "Bob"},
    ]
)

# Declarative configuration via dictionary
# Could be loaded from YAML, JSON, or any external system
check_definitions = [
    {"check-id": "my-null-check", "check": "null-check", "columns": ["name"]},
]
check_set = CheckSet()
check_set.add_checks_from_dicts(check_definitions)

result = BatchDQEngine(check_set).run_batch(df)
print(result.summary())

Prefer Python-native development? Alternatively, you can define checks using Python classes for full type safety, IDE autocompletion, and compile-time validation. See docs for examples of both approaches.

Installation

For Local Development / Standalone Clusters

Install with PySpark included:

pip install sparkdq[spark]

For Databricks / Managed Platforms

Install without PySpark (runtime provided by platform):

pip install sparkdq

The framework supports Python 3.10+ and is fully tested with PySpark 3.5.x. SparkDQ will automatically check for PySpark availability on import and provide clear error messages if PySpark is missing in your environment.

Why SparkDQ?

  • Robust Validation Layer: Clean separation of check definition, execution, and reporting

  • Declarative or Programmatic: Define checks via config files or directly in Python

  • Severity-Aware: Built-in distinction between warning and critical violations

  • Row & Aggregate Logic: Supports both record-level and dataset-wide constraints

  • Typed & Tested: Built with type safety, testability, and extensibility in mind

  • Zero Overhead: Pure PySpark, no heavy dependencies

Typical Use Cases

SparkDQ is built for modern data platforms that demand trust, transparency, and resilience. It helps teams enforce quality standards early and consistently — across ingestion, transformation, and delivery layers.

  • Data Ingestion: Validate raw data as it enters your platform with schema validation, completeness detection, format validation, and early failure detection

  • Lakehouse Quality: Enforce rules before persisting to storage including Delta/Iceberg/Hudi table validation, partition checks, and data freshness validation

  • ML & Analytics: Assert conditions before model training with feature quality checks, training data validation, bias detection, and model I/O validation

  • Pipeline Monitoring: Flag violations in production workflows through real-time alerts, SLA compliance monitoring, data drift detection, and automated incident response

Let’s Build Better Data Together

⭐️ Found this useful? Give it a star and help spread the word!

📣 Questions, feedback, or ideas? Open an issue or discussion — we’d love to hear from you.

🤝 Want to contribute? Check out CONTRIBUTING.md to get started.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkdq-0.11.0.tar.gz (3.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparkdq-0.11.0-py3-none-any.whl (93.6 kB view details)

Uploaded Python 3

File details

Details for the file sparkdq-0.11.0.tar.gz.

File metadata

  • Download URL: sparkdq-0.11.0.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.17

File hashes

Hashes for sparkdq-0.11.0.tar.gz
Algorithm Hash digest
SHA256 42c56522260e29cc3fbe8db1db7b11000cfbcec4bf20d8e52c65b3a15c13cb8c
MD5 a019b550e949a6902074ba71b4f8c8d1
BLAKE2b-256 4bbac06d51c404ba986f9b233acb0952b4ea918014ca26c2dfd7217aabc1e405

See more details on using hashes here.

File details

Details for the file sparkdq-0.11.0-py3-none-any.whl.

File metadata

  • Download URL: sparkdq-0.11.0-py3-none-any.whl
  • Upload date:
  • Size: 93.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.17

File hashes

Hashes for sparkdq-0.11.0-py3-none-any.whl
Algorithm Hash digest
SHA256 84df991942f63a801d7b674ae577b2d26e7585a35d468d500135c261ab1c79af
MD5 c50a42e01ed75604b6762e938d2de841
BLAKE2b-256 738286ff4f9002fd821242e4dab657665d25a14803ddb4b43bb5ce141aa97f27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page