Skip to main content

Python library for data validation on PySpark DataFrame API.

Project description

cuallee

Meaning good in Aztec (Nahuatl)

This library provides an intuitive API to describe checks for Apache PySpark DataFrames v3.3.0. It is a replacement written in pure python of the pydeequ framework.

I gave up in deequ as project does not seem to be maintained, and the multiple issues with the callback server.

Advantages

This implementation goes in hand with the latest API from PySpark and uses the Observation API to collect metrics at the lower cost of computation. When benchmarking against pydeequ, cuallee uses circa <3k java classes underneath and remarkably less memory.

cuallee is inpired by the Green Software Foundation principles, on the advantages of green software.

Checks

is_complete

from cuallee import Check
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Nulls on column Id
check = Check(CheckLevel.WARNING, "Completeness")
check.is_complete("id").validate(spark, spark.range(10))

is_unique

# Unique values on id
check = Check(CheckLevel.WARNING, "Completeness")
check.is_unique("id").validate(spark, spark.range(10))

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cuallee-0.0.2.tar.gz (2.8 kB view hashes)

Uploaded Source

Built Distribution

cuallee-0.0.2-py3-none-any.whl (1.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page