Skip to main content

Plug-and-play Data Quality + Unit Testing for PySpark (batch & streaming) with YAML config, profiling, and optional OpenTelemetry hooks.

Project description

open-spark-dlh-dq

Plug-and-play Data Quality for Apache Spark (batch + streaming) with YAML checks, profiling, and Open-Telemetry.

open-spark-dlh-dq/
├─ pyproject.toml
├─ README.md
├─ LICENSE
│
├─ sparkdq/
│  ├─ __init__.py
│  │
│  ├─ cli/
│  │  └─ main.py                         # CLI: `sparkdq run --yaml open_spark_dlh_dq.yml`
│  │
│  ├─ config/
│  │  ├─ loader.py                       # YAML loader (safe_load + FileNotFound)
│  │  ├─ env.py                          # ENV VAR for spark version & pydeequ jar file
│  │  └─ schema.py                       # dict → DQSuite + bound validators (type/function/unit_tests)
│  │
│  ├─ core/
│  │  ├─ models.py                       #
│  │  ├─ registry.py                     # decorators + resolve_by_path + normalized keys
│  │  ├─ spark.py                        # spark session with deequ jar
│  │  ├─ runner.py                       # calls `validate(df)`; minor robustness
│  │  ├─ reporter.py                     # JSON serialization helpers (optional)
│  │  │
│  │  └─ validators/
│  │     ├─ base.py                      # Validator(name, params, severity?) + `validate(df)`
│  │     ├─ pydeequ_validators.py        # Built-in validators: not_null, uniqueness, row_count_gt, between
│  │     ├─ function_validators.py       # Adapters: FunctionValidator, UnitTestValidator
│  │     ├─ chispa_unit.py               # Optional: chispa helpers (e.g., schema equality)
│  │     └─ __init__.py                  # (optional) import/register built-ins
│  │
│  ├─ profiling/
│  │  └─ profiler.py                     # Optional: summary stats + quantiles + top-k
│  ├─ resources/
│  │  │
│  │  ├─ open_spark_dlh_dq.yml  # Root YAML users edit (source of truth)
│  │  └─ deequ/
│  │     └─ deequ-2.0.12-spark-3.3.jar
│  │  
│  │
│  ├─ observability/
│  │  └─ otel.py                         # Optional: minimal OTel span decorator for future use
│  │
│  └─ integrations/
│     └─ streaming.py                    # Optional: foreachBatch wrapper using suite validators
│
├─ user_checks/                          # Users add their DQ/unit-test functions here
│  ├─ __init__.py
│  └─ example_checks.py                  # Sample @dq_check and @unit_test functions
│
├─ examples/
│  ├─ suites/
│  │  └─ orders_dq.yml                   # Example suite (alt to root YAML)
│  ├─ batch_example.py                   # Sample: load YAML → run suite
│  └─ streaming_example.py               # Sample: foreachBatch usage
│
└─ tests/
   ├─ test_yaml_loader.py                # Verifies YAML parsing → DQSuite
   ├─ test_runner.py                     # Runs suite over small DF
   └─ test_validators.py                 # Unit tests for each validator type


### Default Deequ JAR for Spark 3.3
The library auto-configures Deequ for Spark `3.3` by default. Place the jar `deequ-2.0.12-spark-3.3.jar` in one of:
- `C:\tools` (Windows)
- `/opt/tools`
- `/usr/local/share`

or set `DEEQU_JAR_PATH` to the exact file.

### Override for other Spark versions
Set the following environment variables in your script or shell:

```bash
# Use Spark 3.4 with a different Deequ jar
$env:SPARK_VERSION = "3.4"
$env:DEEQU_JAR_PATH = "C:\tools\deequ-2.0.12-spark-3.4.jar"
```

```bash
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
./build.ps1

```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_sparkdq-0.1.9.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

open_sparkdq-0.1.9-py3-none-any.whl (1.7 MB view details)

Uploaded Python 3

File details

Details for the file open_sparkdq-0.1.9.tar.gz.

File metadata

  • Download URL: open_sparkdq-0.1.9.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for open_sparkdq-0.1.9.tar.gz
Algorithm Hash digest
SHA256 25ce13f1f361270769af237a7d229281cbe216e948c2dec7cd9e797787200536
MD5 8ff3accf36eaddab7eacc70b31d0fc63
BLAKE2b-256 e36e89b4e209172f63f097dbd02abd3072fab02a938cdaeead36d3c148c3a1c2

See more details on using hashes here.

File details

Details for the file open_sparkdq-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: open_sparkdq-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for open_sparkdq-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 84d02ef9e9c8790e7a952b48154f332e8200f87d04dc7e4e8671d1343765c793
MD5 e2c82466712cf2d226249897944af907
BLAKE2b-256 d2d6bddff095d64486e43d180de9f257d6dd5683ee904dc32c35b5f79a142ea0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page