Simple unit testing library for PySpark.
Project description
PySpark Assert
Simple unit testing library for PySpark.
This library is intended for performing unit testing with PySpark on small DataFrames with
functions similar to Pandas' testing module. The API provides two functions, assert_frame_equal
and assert_schema_equal
, which can be used in tests. The former compares two DataFrames and
raises an AssertionError
if they are not equal. The latter does the same, but with schemas.
Usage
Let's say we are testing some custom functionality over PySpark using Pytest.
from pyspark.sql import functions as f
def my_function(df):
"""Adds a column z = x + y."""
return df.withColumn('z', f.col('x') + f.col('y'))
We can simply generate our input and output DataFrames, and compare the result against the expected one.
from pyspark.sql import SparkSession
from pyspark_assert import assert_frame_equal
from my_package import my_function
spark = SparkSession.builder.appName('Test').getOrCreate()
def test_my_function(): # PASSED :)
input_df = spark.createDataFrame([(1, 2)], ['x', 'y'])
expected_df = spark.createDataFrame([(1, 2, 3)], ['x', 'y', 'z'])
output_df = my_function(input_df)
assert_frame_equal(output_df, expected_df)
This function already calls assert_schema_equal
, so there is no need to use it as well, but
one can use it in case they only want to check the resulting schema of an operation. Both have
similar APIs:
- Column types can be checked or ignored, in which case only the name will be checked.
- Column nullability can be ignored as well.
- Columns can have metadata, and it can be checked or not.
- Column order may be ignored, and duplicated names are allowed, but they can be tricky to disambiguate, so they are not encouraged in case column order is not being checked.
- Rows can have any order (for data only, obviously).
- And floating point arithmetic imprecision can be taken into account (data only).
By default, all these checks are performed (type, nullability, metadata, order and float exactitude), but they can be turned off just by setting a parameter to False. For example:
assert_frame_equal(
output_df,
expected_df,
check_types=False,
check_nullable=False,
check_metadata=False,
check_column_order=False,
check_row_order=False,
check_exact=False,
)
Motivation
This library was implemented to avoid having to do the following for unit testing, which may cause some issues.
def test_my_function():
input_df = spark.createDataFrame([(1, 2)], ['x', 'y'])
expected_df = spark.createDataFrame([(1, 2, 3)], ['x', 'y', 'z'])
output_df = my_function(input_df)
assert output_df.collect() == expected_df.collect()
Some of the issues are:
-
Types are not checked. Maybe we want a long column, but the function returns an integer column instead. Since for Python, int and long are both
int
. Thus,collect
may lead to false positives and types should be checked separately. This library automatically checks types in the same call that checks the data. -
Order is not preserved. It's usual for group by operations to return their result without any clear order and many times it's necessary to show the resulting DataFrame to know the order the expected data should have, or order by some kind of primary keys. This method can be confusing for failing tests, since it might not be clear which rows are failing. This library allows the comparison of DataFrames in any order without having to do anything complicated.
-
Floating point numbers comparisons. When we have operations on floating point numbers there is always some imprecision, which we cannot capture directly, unless we perform some rounding, or other similar operations, on them. For example, the above test with the famous
x = 0.1
andy = 0.2
will fail, sincex + y = 0.30000000000000004
. This library can take care of this and the test will pass regardless, even if order is not being checked.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyspark_assert-0.1.0.tar.gz
.
File metadata
- Download URL: pyspark_assert-0.1.0.tar.gz
- Upload date:
- Size: 11.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c982ff2a0b2cd2e089882a9a4e26c3d25197b7dea7ab1795ee20b38c0fe75804 |
|
MD5 | 472867558b244db5b43c484fa403a9f6 |
|
BLAKE2b-256 | f27467bc4a640beccba20704a2c41f7685debf295741b21d7474b00687cf13c5 |
File details
Details for the file pyspark_assert-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: pyspark_assert-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a147324bb15a2defca80abb2e842c8045750763fbee1deebea88fbf76f3ccbf |
|
MD5 | 19b57e35a2692ee25d85fe6737cd865c |
|
BLAKE2b-256 | c3edfd37f2d0a3098e4e3e4b49fdb3e54796573d862341f0b51ba79de250074f |