Plug-and-play Data Quality + Unit Testing for PySpark (batch & streaming) with YAML config, profiling, and optional OpenTelemetry hooks.
Project description
open-spark-dlh-dq
Open source Plug-and-play Data Quality for Apache Spark (Batch + Streaming) with YAML checks, profiling, and OpenTelemetry.
๐ Project Overview
open-spark-dlh-dq is an open-source Python library providing a Data Quality (DQ) framework for Apache Spark.
It supports:
- โ Batch & Streaming DQ with declarative YAML suites
- โ
Custom checks via Python (
dq_check,unit_test) - โ CLI execution for datasets in directories or Spark DataFrames
- โ Inline checks in PySpark scripts
- โ Format support: Parquet, CSV, Iceberg, Delta, JSON, ORC
- โ Profiler & OpenTelemetry for observability
Built on PySpark, PyDeequ, and Chispa, this library enables robust data validation pipelines.
โ Features
- Batch DQ: Validate static datasets using YAML or inline rules.
- Streaming DQ: Apply checks on micro-batches via
foreachBatch. - Custom Checks: Extend with Python functions in
user_checks/. - CLI Tool: Run suites via
sparkdq run --yaml <suite.yml>. - Profiler: Generate summary stats and quantiles.
- OpenTelemetry: Capture spans and traces for test cases.
๐ Repository Structure
open-spark-dlh-dq/ โโ pyproject.toml โโ README.md โโ LICENSE โ โโ sparkdq/ โ โโ cli/main.py # CLI entry point โ โโ config/ # YAML loader, env vars, schema binding โ โโ core/ # Models, registry, Spark session, runner โ โ โโ validators/ # Built-in + custom validator classes โ โโ profiling/profiler.py # Profiling utilities โ โโ resources/open_spark_dlh_dq.yml # Default YAML suite โ โโ observability/otel.py # OpenTelemetry integration โ โโ integrations/streaming.py # foreachBatch wrapper โ โโ user_checks/ # User-defined checks โ โโ example_checks.py โ โโ examples/ # Usage examples โ โโ suites/orders_dq.yml โ โโ batch_example.py โ โโ streaming_example.py โ โโ tests/ # Unit tests โโ test_yaml_loader.py โโ test_chispa_integration.py โโ test_pydeequ_integration.py โโ test_runner.py โโ test_validators.py โโ test_validator_contracts.py โโ test_cli.py
๐ Usage
Run CLI with YAML suite
sparkdq run --yaml ./sparkdq/resources/open_spark_dlh_dq.yml --suite-name orders_dq --format text
Inline checks in PySpark
from sparkdq.core.runner import run_suite
from sparkdq.config.loader import load_yaml_suite
suite = load_yaml_suite("./sparkdq/resources/open_spark_dlh_dq.yml")
df = spark.read.parquet("./data/orders")
run_suite(df, suite)
Streaming example
python examples/streaming_example.py
๐งฉ Custom Checks
Add Python methods in user_checks/example_checks.py:
from sparkdq.core.registry import dq_check, unit_test
@dq_check("amount_positive")
def amount_positive(df):
return df.filter(df.amount > 0).count() == df.count()
Reference them in YAML:
test_cases:
- name: amount_positive
type: dq_check
๐ Profiler & OpenTelemetry
Enable profiling and observability in your pipeline:
from sparkdq.profiling.profiler import profile_df
profile_df(df)
OpenTelemetry spans can be enabled via sparkdq/observability/otel.py.
๐จ Build & Publish
Build for PyPI (Windows)
./build.ps1
Build for PyPI (Linux)
./build.sh
Example Repository to understand how to use
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file open_sparkdq-0.1.10.tar.gz.
File metadata
- Download URL: open_sparkdq-0.1.10.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3f265a2a4152086fa2ec1c9cb009a5d447058643d2bc76cc8c91281100f4365
|
|
| MD5 |
cd5668c457498fb9b01d7ebb9c5c2160
|
|
| BLAKE2b-256 |
8bf05355ba6fbffbbfb1cfaf111cb03f8b5715c7f04142a67114a8248a2dfeb8
|
File details
Details for the file open_sparkdq-0.1.10-py3-none-any.whl.
File metadata
- Download URL: open_sparkdq-0.1.10-py3-none-any.whl
- Upload date:
- Size: 1.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fde613ddcc32a25c77dbff79be7db4760f30cf3fcf18d8841a6a284a89e4533
|
|
| MD5 |
3b53805a8902cc93e69156bfb7a2516b
|
|
| BLAKE2b-256 |
293a331e333e210090499147f6c225a6acc3bdb3c05bb9de9ca452d3c76e7065
|