Skip to main content

Python library for data validation on PySpark DataFrame API.

Project description

cuallee

Meaning good in Aztec (Nahuatl), pronounced: QUAL-E

This library provides an intuitive API to describe checks for Apache PySpark DataFrames v3.3.0. It is a replacement written in pure python of the pydeequ framework.

I gave up in deequ as project does not seem to be maintained, and the multiple issues with the callback server.

Advantages

This implementation goes in hand with the latest API from PySpark and uses the Observation API to collect metrics at the lower cost of computation. When benchmarking against pydeequ, cuallee uses circa <3k java classes underneath and remarkably less memory.

cuallee is inpired by the Green Software Foundation principles, on the advantages of green software.

Install

pip install cuallee

Checks

Completeness and Uniqueness

from cuallee import Check
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Nulls on column Id
check = Check(CheckLevel.WARNING, "Completeness")
(
    check
    .is_complete("id")
    .is_unique("id")
    .validate(spark, df)
).show() # Returns a pyspark.sql.DataFrame

Date Algebra

# Unique values on id
check = Check(CheckLevel.WARNING, "CheckIsBetweenDates")
df = spark.sql("select explode(sequence(to_date('2022-01-01'), to_date('2022-01-10'), interval 1 day)) as date")
assert (
    check.is_between("date", "2022-01-01", "2022-01-10")
    .validate(spark, df)
    .first()
    .status
)

Value Membership

df = spark.createDataFrame([[1, 10], [2, 15], [3, 17]], ["ID", "value"])
check = Check(CheckLevel.WARNING, "is_contained_in_number_test")
check.is_contained_in("value", (10, 15, 20, 25)).validate(spark, df)

Regular Expressions

df = spark.createDataFrame([[1, "is_blue"], [2, "has_hat"], [3, "is_smart"]], ["ID", "desc"])
check = Check(CheckLevel.WARNING, "matches_regex_test")
check.matches_regex("desc", r"^is.*t$") # only match is_smart 33% of rows.
check.validate(spark, df).first().status == "FAIL"

More...

  • are_complete(*cols)
  • matches_regex(col, regex)
  • is_greater_than(col, val)
  • is_greater_or_equal_than(col, val)
  • is_less_than(col, val)
  • is_less_or_equal_than(col, val)
  • is_equal_than(col, val)
  • has_min(col, val)
  • has_max(col, val)
  • has_std(col, val)
  • has_percentile(col, value, percentile, precision, coverage)
  • is_between(col, i, k)
  • is_between(col, date_1, date_2)
  • has_min_by(col2, col1, value)
  • satisfies(predicate, coverage)

Roadmap

This is a very fresh implementation using the Observation API in PySpark v3.3.0. The next round validations in the roadmap include more practical use cases:

  • between_years(y1, y2)
  • in_business_day(col)
  • in_working_time(col)
  • in_weekend(col)
  • is_in_millions(col)
  • is_in_billions(col)
  • has_entropy(col)
  • has_correlation(col1, col2, value)
  • has_mutual_information(col1, col2)

Authors:

  • Herminio Vazquez
  • Virginie Grosboillot

License

Apache License 2.0 Free for commercial use, modification, distribution, patent use, private use. Just preserve the copyright and license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cuallee-0.0.9.tar.gz (11.0 kB view hashes)

Uploaded Source

Built Distribution

cuallee-0.0.9-py3-none-any.whl (10.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page