Plug-and-play Data Quality + Unit Testing for PySpark (batch & streaming) with YAML config, profiling, and optional OpenTelemetry hooks.
Project description
open-spark-dlh-dq
Plug-and-play Data Quality for Apache Spark (batch + streaming) with YAML checks, profiling, and Open-Telemetry.
open-spark-dlh-dq/ ├─ pyproject.toml ├─ README.md ├─ LICENSE │ ├─ sparkdq/ │ ├─ __init__.py │ │ │ ├─ cli/ │ │ └─ main.py # CLI: `sparkdq run --yaml open_spark_dlh_dq.yml` │ │ │ ├─ config/ │ │ ├─ loader.py # YAML loader (safe_load + FileNotFound) │ │ ├─ env.py # ENV VAR for spark version & pydeequ jar file │ │ └─ schema.py # dict → DQSuite + bound validators (type/function/unit_tests) │ │ │ ├─ core/ │ │ ├─ models.py # │ │ ├─ registry.py # decorators + resolve_by_path + normalized keys │ │ ├─ spark.py # spark session with deequ jar │ │ ├─ runner.py # calls `validate(df)`; minor robustness │ │ ├─ reporter.py # JSON serialization helpers (optional) │ │ │ │ │ └─ validators/ │ │ ├─ base.py # Validator(name, params, severity?) + `validate(df)` │ │ ├─ pydeequ_validators.py # Built-in validators: not_null, uniqueness, row_count_gt, between │ │ ├─ function_validators.py # Adapters: FunctionValidator, UnitTestValidator │ │ ├─ chispa_unit.py # Optional: chispa helpers (e.g., schema equality) │ │ └─ __init__.py # (optional) import/register built-ins │ │ │ ├─ profiling/ │ │ └─ profiler.py # Optional: summary stats + quantiles + top-k │ ├─ resources/ │ │ │ │ │ ├─ open_spark_dlh_dq.yml # Root YAML users edit (source of truth) │ │ └─ deequ/ │ │ └─ deequ-2.0.12-spark-3.3.jar │ │ │ │ │ ├─ observability/ │ │ └─ otel.py # Optional: minimal OTel span decorator for future use │ │ │ └─ integrations/ │ └─ streaming.py # Optional: foreachBatch wrapper using suite validators │ ├─ user_checks/ # Users add their DQ/unit-test functions here │ ├─ __init__.py │ └─ example_checks.py # Sample @dq_check and @unit_test functions │ ├─ examples/ │ ├─ suites/ │ │ └─ orders_dq.yml # Example suite (alt to root YAML) │ ├─ batch_example.py # Sample: load YAML → run suite │ └─ streaming_example.py # Sample: foreachBatch usage │ └─ tests/ ├─ test_yaml_loader.py # Verifies YAML parsing → DQSuite ├─ test_runner.py # Runs suite over small DF └─ test_validators.py # Unit tests for each validator type### Default Deequ JAR for Spark 3.3 The library auto-configures Deequ for Spark `3.3` by default. Place the jar `deequ-2.0.12-spark-3.3.jar` in one of: - `C:\tools` (Windows) - `/opt/tools` - `/usr/local/share` or set `DEEQU_JAR_PATH` to the exact file. ### Override for other Spark versions Set the following environment variables in your script or shell: ```bash # Use Spark 3.4 with a different Deequ jar $env:SPARK_VERSION = "3.4" $env:DEEQU_JAR_PATH = "C:\tools\deequ-2.0.12-spark-3.4.jar" ``` ```bash Set-ExecutionPolicy RemoteSigned -Scope CurrentUser ./build.ps1 ```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
open_sparkdq-0.1.9.tar.gz
(1.7 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file open_sparkdq-0.1.9.tar.gz.
File metadata
- Download URL: open_sparkdq-0.1.9.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25ce13f1f361270769af237a7d229281cbe216e948c2dec7cd9e797787200536
|
|
| MD5 |
8ff3accf36eaddab7eacc70b31d0fc63
|
|
| BLAKE2b-256 |
e36e89b4e209172f63f097dbd02abd3072fab02a938cdaeead36d3c148c3a1c2
|
File details
Details for the file open_sparkdq-0.1.9-py3-none-any.whl.
File metadata
- Download URL: open_sparkdq-0.1.9-py3-none-any.whl
- Upload date:
- Size: 1.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84d02ef9e9c8790e7a952b48154f332e8200f87d04dc7e4e8671d1343765c793
|
|
| MD5 |
e2c82466712cf2d226249897944af907
|
|
| BLAKE2b-256 |
d2d6bddff095d64486e43d180de9f257d6dd5683ee904dc32c35b5f79a142ea0
|