Skip to main content

Fast, notebook-first data quality checks for Spark / Databricks

Project description

zynex

PyPI version Python 3.8+ License

Fast, notebook-first data quality checks for Spark / Databricks

zynex is a lightweight data-quality validation library for Apache Spark, designed specifically for Databricks notebooks.
It provides quick, readable checks for common data issues without requiring schemas, configuration files, or heavy setup.

What zynex does

zynex focuses on a small set of high-signal checks that catch the most common data issues in analytical pipelines:

  • Structural issues — duplicate full rows
  • Data quality — null ratios per column
  • Distribution problems — extreme values and skewed data
  • Storage hygiene — small-file detection for Delta tables (metadata only)

The goal is not exhaustive validation, but fast feedback you can trust while working in notebooks.

Installation

pip install zynex

Quick Start

from zynex import zx

zx("schema.table")

API

Primary entry point:

zx(
    source,
    table_name=None,
    render=True,
    cache=False,
    modules=None,
    config=None,
)

Input Modes

1 Validate a catalog table

zx("schema.table")

or with Unity Catalog:

zx("catalog.schema.table")

Behavior:

  • Loads table via spark.table(...)
  • Runs pre-flight metadata checks (if Delta)
  • Runs full data scan

2️ Validate a Spark DataFrame

zx(df)

Behavior:

  • Skips metadata preflight
  • Runs full data scan only

3️ Validate DataFrame with table context

zx(df, table_name="schema.table")

Behavior:

  • Uses provided DataFrame
  • Uses table metadata for preflight checks
  • Avoids re-reading table

Optional Arguments

render

Default: True

If False, returns a ValidationReport object instead of printing.

report = zx("schema.table", render=False)

cache

Default: False

If True, DataFrame is persisted during validation.

zx("schema.table", cache=True)

Recommended for large datasets.

modules

Default: ["core_quality"]

You can explicitly select modules:

zx("schema.table", modules=["core_quality"])

config

Override rule configuration:

zx(
    "schema.table",
    config={
        "extreme_values_threshold_stddev": 2.0
    }
)

Currently supported config keys:

  • extreme_values_threshold_stddev (default: 3.0)
  • cache (internal, set via argument)

Output Structure

Zynex prints:

  • Dataset summary (rows × columns)
  • Rule results grouped by:
    • OK
    • WARNING
    • ERROR
    • NOT_APPLICABLE

Example:

ZYNEX REPORT
Dataset: 240 000 rows x 10 columns | 0 Errors | 3 Warnings

[WARNING] duplicate_rows
[WARNING] null_ratio
[WARNING] extreme_values

Pre-Flight Behavior

When validating a table:

  • Metadata checks run first (e.g., small_files)
  • Results are printed immediately
  • Validation continues regardless of warnings

Zynex does not block execution.

If fragmentation is detected:

  • Recommendation is shown
  • User decides whether to run OPTIMIZE

Table Name Errors

If a table name is incorrect:

zx("schema.wrong_name")

Zynex prints:

  • Clear error message
  • Suggested similar tables (if available)
  • Hint to use SHOW TABLES

Validation stops early in this case.

Design Philosophy

Zynex is:

  • Spark-native
  • Notebook-first
  • Advisory (not policy enforcement)
  • Lightweight and modular

It is not:

  • A data governance framework
  • A pipeline orchestrator
  • A blocking validation gate

Return Value

If render=False, returns:

ValidationReport

Containing:

  • row count
  • column count
  • rule results
  • metrics
  • messages

Inspecting the result programmatically

print(report.rows)
print(report.columns)

Iterate over rule results

for r in report.results:
    print(r.name, r.status)

Example:

for r in report.results:
    if r.name == "duplicate_rows":
        print(r.metrics)

Example metrics for duplicate_rows:

{
    "total_rows": 4.0,
    "unique_rows": 3.0,
    "duplicate_rows": 1.0
}

Example metrics for null_ratio:

{
    "total_nulls": 3,
    "per_column": {
        "name": {"nulls": 1},
        "email": {"nulls": 1},
        "age": {"nulls": 1}
    }
}

Requirements

  • Spark 3.x
  • Databricks or compatible Spark environment
  • Delta tables for metadata preflight

Development

# Install in editable mode with dev dependencies
pip install -e .[local,dev]

# Run tests
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zynex-1.0.0.tar.gz (30.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zynex-1.0.0-py3-none-any.whl (31.0 kB view details)

Uploaded Python 3

File details

Details for the file zynex-1.0.0.tar.gz.

File metadata

  • Download URL: zynex-1.0.0.tar.gz
  • Upload date:
  • Size: 30.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for zynex-1.0.0.tar.gz
Algorithm Hash digest
SHA256 6dd1f49fc418042faa79f2b66501c8b978f90dcfa25229741012601297b9911b
MD5 0ef278683d364feacfb795f483adeba6
BLAKE2b-256 2e293d896d1e4486ce332c50a871b9eca26d1d21296eb2912bf83ea18a295a23

See more details on using hashes here.

File details

Details for the file zynex-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: zynex-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 31.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for zynex-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ddf689d0b8fb972f12570c435ad3acefbf935b43459e787f33df50a223f25e6
MD5 2aa9e60745fbbd359f6f79fa19c6b125
BLAKE2b-256 e62b9ac1ddd2091a3625e29a28b23ded9dedde3063110a109bebb6a91add7c60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page