Skip to main content

A tool for regression testing Spark Dataframes in Python

Project description

pyspark-regression

pyspark-regression is a concise, no-nonsense library for regression testing between PySpark Dataframes.

For install instructions and API documentation, please visit https://forrest-bajbek.github.io/pyspark-regression/

What is a Regression Test?

A Regression Test ensures that changes to code only produce expected outcomes, introducing no new bugs. These tests are particularly challenging when working with database tables, as the result can be too large to visually inspect. When updating a SQL transformation, Data Engineers must ensure that no rows or columns were unintentionally altered, even if the table has hundreds columns and billions of rows.

pyspark-regression reduces the complexity of Regression Testing by implementing a clean Python API for running regression tests between DataFrames in Apache Spark.

Example

Consider the following table:

id name price
1 Taco 3.001
2 Burrito 6.50
3 flauta 7.50

Imagine you are a Data Engineer, and you want to change the underlying ETL so that:

  1. The price for Tacos is rounded to 2 decimal places.
  2. The name for Flautas is capitalized.

You make your changes, and the new table looks like this:

id name price
1 Taco 3.00
2 Burrito 6.50
3 Flauta 7.50

Running a regression test will help you confirm that the new ETL changed the data how you expected.

Let's create the old and new tables as dataframes so we can run a Regression Test:

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark_regression import RegressionTest

spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 1)

schema = StructType(
    [
        StructField("id", IntegerType()),
        StructField("name", StringType()),
        StructField("price", DoubleType()),
    ]
)

# The old data
df_old = spark.createDataFrame(
    [
        (1, 'Taco', 3.001),
        (2, 'Burrito', 6.50),
        (3, 'flauta', 7.50),
    ],
    schema=schema
)

# The new data
df_new = spark.createDataFrame(
    [
        (1, 'Taco', 3.00),  # Corrected price
        (2, 'Burrito', 6.50),
        (3, 'Flauta', 7.50),  # Corrected name
    ],
    schema=schema
)

regression_test = RegressionTest(
    df_old=df_old,
    df_new=df_new,
    pk='id',
)

RegressionTest() returns a Python class with properties that let you inspect the differences between dataframes. Most notably, the summary property prints a comprehensive analysis in Markdown.

>>> print(regression_test.summary)

# Regression Test: df
- run_id: de9bd4eb-5313-4057-badc-7322ee23b83b
- run_time: 2022-05-25 08:53:50.581283

## Result: **FAILURE**.
Printing Regression Report...

### Table stats
- Count records in old df: 3
- Count records in new df: 3
- Count pks in old df: 3
- Count pks in new df: 3

### Diffs
- Columns with diffs: {'name', 'price'}
- Number of records with diffs: 2 (%oT: 66.7%)

 Diff Summary:
| column_name   | data_type   | diff_category        |   count_record | count_record_%oT   |
|:--------------|:------------|:---------------------|---------------:|:-------------------|
| name          | string      | capitalization added |              1 | 33.3%              |
| price         | double      | rounding             |              1 | 33.3%              |

 Diff Samples: (5 samples per column_name, per diff_category, per is_duplicate)
| column_name   | data_type   |   pk | old_value   | new_value   | diff_category        |
|:--------------|:------------|-----:|:------------|:------------|:---------------------|
| name          | string      |    3 | 'flauta'    | 'Flauta'    | capitalization added |
| price         | double      |    1 | 3.001       | 3.0         | rounding             |

The RegressionTest class provides low level access to all the methods used to build the summary:

>>> print(regression_test.count_record_old) # count of records in df_old
3

>>> print(regression_test.count_record_new) # count of records in df_new
3

>>> print(regression_test.columns_diff) # Columns with diffs
{'name', 'price'}

>>> regression_test.df_diff.filter("column_name = 'price'").show() # Show all diffs for 'price' column
+-----------+---------+---+---------+---------+-------------+
|column_name|data_type| pk|old_value|new_value|diff_category|
+-----------+---------+---+---------+---------+-------------+
|      price|   double|  1|    3.001|      3.0|     rounding|
+-----------+---------+---+---------+---------+-------------+

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_regression-4.2.4.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_regression-4.2.4-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_regression-4.2.4.tar.gz.

File metadata

  • Download URL: pyspark_regression-4.2.4.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyspark_regression-4.2.4.tar.gz
Algorithm Hash digest
SHA256 f2e6759228584eb35dca66a7cd7e444ec99cc6975b1be3cec5a5c5f258274460
MD5 4b0b0549a26e864329f837c5bd596d11
BLAKE2b-256 f947153b6dbfa851e7ecccba49cb642b0a32a357a02ee81963daccde5a5acf06

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_regression-4.2.4.tar.gz:

Publisher: release.yaml on forrest-bajbek/pyspark-regression

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyspark_regression-4.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_regression-4.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 850383faac7e2fec98e91933213cbab27050f3687509c065a26bdef88b11cb1a
MD5 5f94c29020e229a5006b9a70b18ba235
BLAKE2b-256 b8e7324971a51e6db52783aac6a7fa9678adf9aa8caaf5bfaf8d068b00f815f1

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_regression-4.2.4-py3-none-any.whl:

Publisher: release.yaml on forrest-bajbek/pyspark-regression

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page