Python library for data validation on PySpark DataFrame API.

These details have not been verified by PyPI

Project links

GitHub Statistics

Project description

cuallee

Meaning good in Aztec (Nahuatl), pronounced: QUAL-E

This library provides an intuitive API to describe checks for Apache PySpark DataFrames v3.3.0. It is a replacement written in pure python of the pydeequ framework.

I gave up in deequ as project does not seem to be maintained, and the multiple issues with the callback server.

Advantages

This implementation goes in hand with the latest API from PySpark and uses the Observation API to collect metrics at the lower cost of computation. When benchmarking against pydeequ, cuallee uses circa <3k java classes underneath and remarkably less memory.

cuallee is inpired by the Green Software Foundation principles, on the advantages of green software.

Checks

Completeness and Uniqueness

from cuallee import Check
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Nulls on column Id
check = Check(CheckLevel.WARNING, "Completeness")
(
    check
    .is_complete("id")
    .is_unique("id")
    .validate(spark, df)
).show() # Returns a pyspark.sql.DataFrame

Date Algebra

# Unique values on id
check = Check(CheckLevel.WARNING, "CheckIsBetweenDates")
df = spark.sql("select explode(sequence(to_date('2022-01-01'), to_date('2022-01-10'), interval 1 day)) as date")
assert (
    check.is_between("date", "2022-01-01", "2022-01-10")
    .validate(spark, df)
    .first()
    .status
)

Value Membership

df = spark.createDataFrame([[1, 10], [2, 15], [3, 17]], ["ID", "value"])
check = Check(CheckLevel.WARNING, "is_contained_in_number_test")
check.is_contained_in("value", (10, 15, 20, 25)).validate(spark, df)

Regular Expressions

df = spark.createDataFrame([[1, "is_blue"], [2, "has_hat"], [3, "is_smart"]], ["ID", "desc"])
check = Check(CheckLevel.WARNING, "matches_regex_test")
check.matches_regex("desc", r"^is.*t$") # only match is_smart 33% of rows.
check.validate(spark, df).first().status == "FAIL"

Real Usage

check = Check(CheckLevel.ERROR, "IndexPrices")
(
    check
    .is_complete("BusinessDateTime")
    .is_complete("CMAEntityId")
    .is_complete("CMATicker")
    .is_complete("EntityName")
    .is_complete("Region")
    .is_complete("Seniority")
    .is_complete("Currency")
    .is_complete("RestructuringType")
    .is_complete("InstrumentType")
    .is_complete("Tenor")
    .is_complete("MaturityDate")
    .is_complete("MarketQuotingConvention")
    .is_complete("ObservedDerivedIndicator")
    .is_complete("Coupon")
    .is_complete("MarketRecoveryRate")
    .is_unique("CMATicker")
    .is_contained_in("Seniority", ["Senior", "SeniorLAC", "Subordinated"])
    .is_contained_in("InstrumentType", ["Index", "Single Name", "Tranche"])
    .is_contained_in("MarketQuotingConvention", ["PercentOfPar", "QuoteSpread", "Upfront"])
    .is_contained_in("ObservedDerivedIndicator", ["D", "O"])
    .is_between("Coupon", 25, 500)
    .is_between("MarketRecoveryRate", 0, 100)
    .is_between("Tenor", 0, 30)
    validate(spark, df)
).show(truncate=False)
+---+----------+--------+----------+-------+------------------------+---------------+------------------------------------------+-----+---------+--------------+------+
|id |date      |time    |check     |level  |column                  |rule           |value                                     |rows |pass_rate|pass_threshold|status|
+---+----------+--------+----------+-------+------------------------+---------------+------------------------------------------+-----+---------+--------------+------+
|1  |2022-09-21|01:05:51|CdsPricing|WARNING|CMATicker               |is_unique      |N/A                                       |42462|0.06     |1.0           |FAIL  |
|2  |2022-09-21|01:05:51|CdsPricing|WARNING|MaturityDate            |is_complete    |N/A                                       |42462|1.0      |1.0           |PASS  |
|3  |2022-09-21|01:05:51|CdsPricing|WARNING|MarketRecoveryRate      |is_complete    |N/A                                       |42462|1.0      |1.0           |PASS  |
|4  |2022-09-21|01:05:51|CdsPricing|WARNING|InstrumentType          |is_complete    |N/A                                       |42462|1.0      |1.0           |PASS  |
|5  |2022-09-21|01:05:51|CdsPricing|WARNING|CMATicker               |is_complete    |N/A                                       |42462|1.0      |1.0           |PASS  |
|6  |2022-09-21|01:05:51|CdsPricing|WARNING|Seniority               |is_contained_in|('Senior', 'SeniorLAC', 'Subordinated')   |42462|1.0      |1.0           |PASS  |
|7  |2022-09-21|01:05:51|CdsPricing|WARNING|MarketQuotingConvention |is_complete    |N/A                                       |42462|1.0      |1.0           |PASS  |
|8  |2022-09-21|01:05:51|CdsPricing|WARNING|Region                  |is_complete    |N/A                                       |42462|1.0      |1.0           |PASS  |
|9  |2022-09-21|01:05:51|CdsPricing|WARNING|Coupon                  |is_complete    |N/A                                       |42462|1.0      |1.0           |PASS  |
|10 |2022-09-21|01:05:51|CdsPricing|WARNING|BusinessDateTime        |is_complete    |N/A                                       |42462|1.0      |1.0           |PASS  |
|11 |2022-09-21|01:05:51|CdsPricing|WARNING|InstrumentType          |is_contained_in|('Index', 'Single Name', 'Tranche')       |42462|1.0      |1.0           |PASS  |
|12 |2022-09-21|01:05:51|CdsPricing|WARNING|ObservedDerivedIndicator|is_complete    |N/A                                       |42462|1.0      |1.0           |PASS  |
|13 |2022-09-21|01:05:51|CdsPricing|WARNING|Coupon                  |is_between     |(25, 500)                                 |42462|1.0      |1.0           |PASS  |
|14 |2022-09-21|01:05:51|CdsPricing|WARNING|EntityName              |is_complete    |N/A                                       |42462|1.0      |1.0           |PASS  |
|15 |2022-09-21|01:05:51|CdsPricing|WARNING|MarketRecoveryRate      |is_between     |(0, 100)                                  |42462|1.0      |1.0           |PASS  |
|16 |2022-09-21|01:05:51|CdsPricing|WARNING|Tenor                   |is_between     |(0, 30)                                   |42462|1.0      |1.0           |PASS  |
|17 |2022-09-21|01:05:51|CdsPricing|WARNING|RestructuringType       |is_complete    |N/A                                       |42462|1.0      |1.0           |PASS  |
|18 |2022-09-21|01:05:51|CdsPricing|WARNING|ObservedDerivedIndicator|is_contained_in|('D', 'O')                                |42462|1.0      |1.0           |PASS  |
|19 |2022-09-21|01:05:51|CdsPricing|WARNING|Tenor                   |is_complete    |N/A                                       |42462|1.0      |1.0           |PASS  |
|20 |2022-09-21|01:05:51|CdsPricing|WARNING|MarketQuotingConvention |is_contained_in|('PercentOfPar', 'QuoteSpread', 'Upfront')|42462|1.0      |1.0           |PASS  |
+---+----------+--------+----------+-------+------------------------+---------------+------------------------------------------+-----+---------+--------------+------+

More...

are_complete(*cols)
matches_regex(col, regex)
is_greater_than(col, val)
is_greater_or_equal_than(col, val)
is_less_than(col, val)
is_less_or_equal_than(col, val)
is_equal_than(col, val)
has_min(col, val)
has_max(col, val)
has_std(col, val)
has_percentile(col, value, percentile, precision, coverage)
is_between(col, i, k)
is_between(col, date_1, date_2)
has_min_by(col2, col1, value)
satisfies(predicate, coverage)

Roadmap

This is a very fresh implementation using the Observation API in PySpark v3.3.0. The next round validations in the roadmap include more practical use cases:

between_years(y1, y2)
in_business_day(col)
in_working_time(col)
in_weekend(col)
is_in_millions(col)
is_in_billions(col)
has_entropy(col)
has_correlation(col1, col2, value)
has_mutual_information(col1, col2)

Authors:

Herminio Vazquez
Virginie Grosboillot

License

Apache License 2.0 Free for commercial use, modification, distribution, patent use, private use. Just preserve the copyright and license.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

Release history Release notifications | RSS feed

0.13.1

Jul 14, 2024

0.13.0

Jul 13, 2024

0.12.7

Jul 13, 2024

0.12.6

Jul 13, 2024

0.12.5

Jul 13, 2024

0.12.4

Jul 12, 2024

0.12.3

Jul 8, 2024

0.12.2

Jul 7, 2024

0.12.1

Jul 6, 2024

0.11.1

Jun 29, 2024

0.11.0

Jun 24, 2024

0.10.4

Jun 18, 2024

0.10.3

May 18, 2024

0.10.2

May 11, 2024

0.10.1

Apr 30, 2024

0.10.0

Mar 27, 2024

0.9.2

Mar 23, 2024

0.9.1

Mar 22, 2024

0.9.0

Mar 17, 2024

0.8.8

Mar 7, 2024

0.8.7

Mar 4, 2024

0.8.6

Mar 4, 2024

0.8.5

Feb 11, 2024

0.8.4

Feb 11, 2024

0.8.3

Feb 11, 2024

0.8.2

Feb 11, 2024

0.8.1

Feb 11, 2024

0.8.0

Feb 10, 2024

0.7.8

Feb 7, 2024

0.7.7

Feb 3, 2024

0.7.5

Jan 30, 2024

0.7.4

Jan 28, 2024

0.7.3

Dec 29, 2023

0.7.0

Dec 29, 2023

0.6.1

Oct 28, 2023

0.6.0

Oct 1, 2023

0.5.5

Oct 1, 2023

0.5.4

Sep 30, 2023

0.5.3

Sep 27, 2023

0.5.2

Sep 16, 2023

0.5.1

Sep 9, 2023

0.5.0

Aug 25, 2023

0.4.9

Aug 21, 2023

0.4.8

Aug 21, 2023

0.4.7

Jul 19, 2023

0.4.6

Jul 1, 2023

0.4.5

Jun 10, 2023

0.4.4

Jun 3, 2023

0.4.3

Jun 3, 2023

0.4.2

Jun 3, 2023

0.4.1

May 14, 2023

0.4.0

May 14, 2023

0.3.6

Feb 20, 2023

0.3.5

Feb 20, 2023

0.3.4

Feb 20, 2023

0.3.3

Feb 20, 2023

0.3.2

Feb 20, 2023

0.3.1

Dec 4, 2022

0.3.0

Dec 4, 2022

0.2.5

Nov 26, 2022

0.2.4

Nov 20, 2022

0.2.2

Nov 7, 2022

0.2.1

Nov 3, 2022

0.2.0

Oct 31, 2022

0.1.8

Oct 27, 2022

0.1.7

Oct 25, 2022

0.1.6

Oct 24, 2022

0.1.5

Oct 24, 2022

0.1.4

Oct 23, 2022

0.1.3

Oct 23, 2022

0.1.2

Oct 23, 2022

0.1.1

Oct 23, 2022

0.1.0

Oct 22, 2022

0.0.16

Oct 18, 2022

0.0.15

Oct 17, 2022

0.0.14

Oct 9, 2022

0.0.13

Oct 9, 2022

0.0.12

Oct 9, 2022

0.0.11

Oct 8, 2022

0.0.10

Sep 30, 2022

0.0.9

Sep 23, 2022

0.0.8

Sep 21, 2022

0.0.7

Sep 21, 2022

0.0.6

Sep 21, 2022

0.0.5

Sep 21, 2022

0.0.4

Sep 21, 2022

This version

0.0.3

Sep 20, 2022

0.0.2

Sep 20, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cuallee-0.0.3.tar.gz (8.3 kB view hashes)

Uploaded Sep 20, 2022 Source

Built Distribution

cuallee-0.0.3-py3-none-any.whl (7.5 kB view hashes)

Uploaded Sep 20, 2022 Python 3

Hashes for cuallee-0.0.3.tar.gz

Hashes for cuallee-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`763e2cf7467754fb11649ae35ee3eab74ea828314ed8ff039687d53ee25c3bb4`
MD5	`3afe8da2c5d85420274fed8d72e20f60`
BLAKE2b-256	`13c2436ec0958b75caf193b6622ebbcecc0496f8098d7260cf57b8e22b171440`

Hashes for cuallee-0.0.3-py3-none-any.whl

Hashes for cuallee-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d2dc2659a88b8e233c90bf12ad7adc170f83d551c1dae8658e2c46dda9294afc`
MD5	`5318fe0c096351d17477e1116f2a1770`
BLAKE2b-256	`b3318c29ebb02438bc476bdfc6b6b468ee5982c3a3d7eacba0f33b5bbc60f224`