Fast, notebook-first data quality checks for Spark / Databricks
Project description
zynex
Fast, notebook-first data quality checks for Spark / Databricks
zynex is a lightweight data-quality validation library for Apache Spark, designed specifically for Databricks notebooks.
It provides quick, readable checks for common data issues without requiring schemas, configuration files, or heavy setup.
What zynex does
zynex focuses on a small set of high-signal checks that catch the most common data issues in analytical pipelines:
- Structural issues — duplicate full rows
- Data quality — null ratios per column
- Distribution problems — extreme values and skewed data
- Storage hygiene — small-file detection for Delta tables (metadata only)
The goal is not exhaustive validation, but fast feedback you can trust while working in notebooks.
Installation
pip install zynex
Quick Start
from zynex import zx
zx("schema.table")
API
Primary entry point:
zx(
source,
table_name=None,
render=True,
cache=False,
modules=None,
config=None,
)
Input Modes
1 Validate a catalog table
zx("schema.table")
or with Unity Catalog:
zx("catalog.schema.table")
Behavior:
- Loads table via spark.table(...)
- Runs pre-flight metadata checks (if Delta)
- Runs full data scan
2️ Validate a Spark DataFrame
zx(df)
Behavior:
- Skips metadata preflight
- Runs full data scan only
3️ Validate DataFrame with table context
zx(df, table_name="schema.table")
Behavior:
- Uses provided DataFrame
- Uses table metadata for preflight checks
- Avoids re-reading table
Optional Arguments
render
Default: True
If False, returns a ValidationReport object instead of printing.
report = zx("schema.table", render=False)
cache
Default: False
If True, DataFrame is persisted during validation.
zx("schema.table", cache=True)
Recommended for large datasets.
modules
Default: ["core_quality"]
You can explicitly select modules:
zx("schema.table", modules=["core_quality"])
config
Override rule configuration:
zx(
"schema.table",
config={
"extreme_values_threshold_stddev": 2.0
}
)
Currently supported config keys:
- extreme_values_threshold_stddev (default: 3.0)
- cache (internal, set via argument)
Output Structure
Zynex prints:
- Dataset summary (rows × columns)
- Rule results grouped by:
- OK
- WARNING
- ERROR
- NOT_APPLICABLE
Example:
ZYNEX REPORT
Dataset: 240 000 rows x 10 columns | 0 Errors | 3 Warnings
[WARNING] duplicate_rows
[WARNING] null_ratio
[WARNING] extreme_values
Pre-Flight Behavior
When validating a table:
- Metadata checks run first (e.g., small_files)
- Results are printed immediately
- Validation continues regardless of warnings
Zynex does not block execution.
If fragmentation is detected:
- Recommendation is shown
- User decides whether to run OPTIMIZE
Table Name Errors
If a table name is incorrect:
zx("schema.wrong_name")
Zynex prints:
- Clear error message
- Suggested similar tables (if available)
- Hint to use SHOW TABLES
Validation stops early in this case.
Design Philosophy
Zynex is:
- Spark-native
- Notebook-first
- Advisory (not policy enforcement)
- Lightweight and modular
It is not:
- A data governance framework
- A pipeline orchestrator
- A blocking validation gate
Return Value
If render=False, returns:
ValidationReport
Containing:
- row count
- column count
- rule results
- metrics
- messages
Inspecting the result programmatically
print(report.rows)
print(report.columns)
Iterate over rule results
for r in report.results:
print(r.name, r.status)
Example:
for r in report.results:
if r.name == "duplicate_rows":
print(r.metrics)
Example metrics for duplicate_rows:
{
"total_rows": 4.0,
"unique_rows": 3.0,
"duplicate_rows": 1.0
}
Example metrics for null_ratio:
{
"total_nulls": 3,
"per_column": {
"name": {"nulls": 1},
"email": {"nulls": 1},
"age": {"nulls": 1}
}
}
Requirements
- Spark 3.x
- Databricks or compatible Spark environment
- Delta tables for metadata preflight
Development
# Install in editable mode with dev dependencies
pip install -e .[local,dev]
# Run tests
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zynex-1.0.0.tar.gz.
File metadata
- Download URL: zynex-1.0.0.tar.gz
- Upload date:
- Size: 30.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6dd1f49fc418042faa79f2b66501c8b978f90dcfa25229741012601297b9911b
|
|
| MD5 |
0ef278683d364feacfb795f483adeba6
|
|
| BLAKE2b-256 |
2e293d896d1e4486ce332c50a871b9eca26d1d21296eb2912bf83ea18a295a23
|
File details
Details for the file zynex-1.0.0-py3-none-any.whl.
File metadata
- Download URL: zynex-1.0.0-py3-none-any.whl
- Upload date:
- Size: 31.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ddf689d0b8fb972f12570c435ad3acefbf935b43459e787f33df50a223f25e6
|
|
| MD5 |
2aa9e60745fbbd359f6f79fa19c6b125
|
|
| BLAKE2b-256 |
e62b9ac1ddd2091a3625e29a28b23ded9dedde3063110a109bebb6a91add7c60
|