Testframework for PySpark DataFrames

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tiddoloos tomergabay

These details have not been verified by PyPI

Project description

Build Status

pyspark-testframework

The goal of the pyspark-testframework is to provide a simple way to create tests for PySpark DataFrames. The test results are returned in DataFrame format as well.

[!NOTE] From version v3.*.* we changed from a wide-format to a long-format structure for storing test results. This long-format approach makes it easier to:

Filter and analyze specific test results

Add new tests without changing the schema

Perform aggregations across different tests

Export results to other systems

Track when tests were executed

Include actual values that were tested for debugging

Test Results

The framework uses a long-format structure for storing test results. Each test result is stored as a separate row with the following columns:

primary_key: Primary key value as string (e.g., "1", "2", "3")
primary_key_col: Name of the primary key column (e.g., "id")
test_name: Name of the test (e.g., "ValidStreetFormat")
test_col: Name of the column that was tested (e.g., "street")
test_value: The actual value that was tested (e.g., "Rochussenstraat")
test_result: Boolean result of the test (True/False)
test_description: Description of the test
timestamp: UTC timestamp when the test was executed
Additional columns: Any additional context columns specified during initialization (e.g., if you pass context_cols=["street", "house_number"], these columns will be included in the results)

Tutorial

Let's first create an example pyspark DataFrame

The data will contain the primary keys, street names and house numbers of some addresses.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import functions as F

# Initialize Spark session
spark = SparkSession.builder.appName("PySparkTestFrameworkTutorial").getOrCreate()

# Define the schema
schema = StructType(
    [
        StructField("id", IntegerType(), True),
        StructField("street", StringType(), True),
        StructField("house_number", IntegerType(), True),
    ]
)

# Define the data
data = [
    (1, "Rochussenstraat", 27),
    (2, "Coolsingel", 31),
    (3, "%Witte de Withstraat", 27),
    (4, "Lijnbaan", -3),
    (5, None, 13),
]

df = spark.createDataFrame(data, schema)

df.show(truncate=False)

+---+--------------------+------------+
|id |street              |house_number|
+---+--------------------+------------+
|1  |Rochussenstraat     |27          |
|2  |Coolsingel          |31          |
|3  |%Witte de Withstraat|27          |
|4  |Lijnbaan            |-3          |
|5  |null                |13          |
+---+--------------------+------------+

Import and initialize the DataFrameTester

from testframework.dataquality import DataFrameTester

df_tester = DataFrameTester(
    df=df,
    primary_key="id",
    spark=spark,
)

Import configurable tests

from testframework.dataquality.tests import ValidNumericRange, RegexTest

Initialize the RegexTest to test for valid street names

valid_street_format = RegexTest(
    name="ValidStreetFormat",
    pattern=r"^[A-Z][a-zéèáàëï]*([ -][A-Z]?[a-zéèáàëï]*)*$",
)

Run valid_street_format on the street column using the .test() method of DataFrameTester.

df_tester.test(
    col="street",
    test=valid_street_format,
    nullable=False,  # nullable is False, hence null values are converted to False
    description="Street is in valid Dutch street format",
).show(truncate=False)

+-----------+-------------------------+-----------+--------------------+--------------------------------------+--------+
|primary_key|test_name                |test_result|test_value          |test_description                      |test_col|
+-----------+-------------------------+-----------+--------------------+--------------------------------------+--------+
|1          |street__ValidStreetFormat|true       |Rochussenstraat     |Street is in valid Dutch street format|street  |
|2          |street__ValidStreetFormat|true       |Coolsingel          |Street is in valid Dutch street format|street  |
|3          |street__ValidStreetFormat|false      |%Witte de Withstraat|Street is in valid Dutch street format|street  |
|4          |street__ValidStreetFormat|true       |Lijnbaan            |Street is in valid Dutch street format|street  |
|5          |street__ValidStreetFormat|false      |null                |Street is in valid Dutch street format|street  |
+-----------+-------------------------+-----------+--------------------+--------------------------------------+--------+

Run the IntegerString test on the number column

By setting the return_failed_rows parameter to True, we can get only the rows that failed the test.

df_tester.test(
    col="house_number",
    test=ValidNumericRange(
        min_value=1,
    ),
    nullable=False,
    # description="House number is in a valid format" # optional, let's not define it for illustration purposes
    return_failed_rows=True,  # only return the failed rows
).show()

+-----------+--------------------+-----------+----------+-----------------+------------+
|primary_key|           test_name|test_result|test_value| test_description|    test_col|
+-----------+--------------------+-----------+----------+-----------------+------------+
|          4|house_number__Val...|      false|        -3|ValidNumericRange|house_number|
+-----------+--------------------+-----------+----------+-----------------+------------+

Let's take a look at the test results of the DataFrame using the .results attribute.

df_tester.results.show(truncate=False)

+-----------+-------------------------------+-----------+--------------------+--------------------------------------+------------+---------------+-----------------------+
|primary_key|test_name                      |test_result|test_value          |test_description                      |test_col    |primary_key_col|timestamp              |
+-----------+-------------------------------+-----------+--------------------+--------------------------------------+------------+---------------+-----------------------+
|1          |street__ValidStreetFormat      |true       |Rochussenstraat     |Street is in valid Dutch street format|street      |id             |2025-10-13 15:30:53.094|
|2          |street__ValidStreetFormat      |true       |Coolsingel          |Street is in valid Dutch street format|street      |id             |2025-10-13 15:30:53.094|
|3          |street__ValidStreetFormat      |false      |%Witte de Withstraat|Street is in valid Dutch street format|street      |id             |2025-10-13 15:30:53.094|
|4          |street__ValidStreetFormat      |true       |Lijnbaan            |Street is in valid Dutch street format|street      |id             |2025-10-13 15:30:53.094|
|5          |street__ValidStreetFormat      |false      |null                |Street is in valid Dutch street format|street      |id             |2025-10-13 15:30:53.094|
|1          |house_number__ValidNumericRange|true       |27                  |ValidNumericRange                     |house_number|id             |2025-10-13 15:30:53.094|
|2          |house_number__ValidNumericRange|true       |31                  |ValidNumericRange                     |house_number|id             |2025-10-13 15:30:53.094|
|3          |house_number__ValidNumericRange|true       |27                  |ValidNumericRange                     |house_number|id             |2025-10-13 15:30:53.094|
|4          |house_number__ValidNumericRange|false      |-3                  |ValidNumericRange                     |house_number|id             |2025-10-13 15:30:53.094|
|5          |house_number__ValidNumericRange|true       |13                  |ValidNumericRange                     |house_number|id             |2025-10-13 15:30:53.094|
+-----------+-------------------------------+-----------+--------------------+--------------------------------------+------------+---------------+-----------------------+

Custom tests

Sometimes tests are too specific or complex to be covered by the configurable tests. That's why we can create custom tests and add them to the DataFrameTester object.

Let's do this using a custom test which should tests that every house has a bath room. We'll start by creating a new DataFrame with rooms rather than houses.

rooms = [
    (1, 1, "living room"),
    (2, 1, "bathroom"),
    (3, 1, "kitchen"),
    (4, 1, "bed room"),
    (5, 2, "living room"),
    (6, 2, "bed room"),
    (7, 2, "kitchen"),
]

schema_rooms = StructType(
    [
        StructField("id", IntegerType(), True),
        StructField("house_id", IntegerType(), True),
        StructField("room", StringType(), True),
    ]
)

room_df = spark.createDataFrame(rooms, schema=schema_rooms)

room_df.show(truncate=False)

+---+--------+-----------+
|id |house_id|room       |
+---+--------+-----------+
|1  |1       |living room|
|2  |1       |bathroom   |
|3  |1       |kitchen    |
|4  |1       |bed room   |
|5  |2       |living room|
|6  |2       |bed room   |
|7  |2       |kitchen    |
+---+--------+-----------+

To create a custom test, we should create a pyspark DataFrame which contains the same primary_key column as the DataFrame to be tested using the DataFrameTester.

Let's create a boolean column that indicates whether the house has a bath room or not.

house_has_bathroom = room_df.groupBy("house_id").agg(
    F.max(F.when(F.col("room") == "bathroom", True).otherwise(False)).alias(
        "has_bathroom"
    )
)

house_has_bathroom.show(truncate=False)

+--------+------------+
|house_id|has_bathroom|
+--------+------------+
|1       |true        |
|2       |false       |
+--------+------------+

We can add this 'custom test' to the DataFrameTester using add_custom_test_result.

In the background, all kinds of data validation checks are done by DataFrameTester to make sure that it fits the requirements to be added to the other test results.

df_tester.add_custom_test_result(
    result=house_has_bathroom.withColumnRenamed("house_id", "id"),
    name="has_bathroom",
    description="House has a bathroom",
    # fillna_value=0, # optional; by default null.
).show(truncate=False)

+-----------+------------+-----------+-----------------------+--------------------+---------------------+---------------+-----------------------+
|primary_key|test_name   |test_result|test_value             |test_description    |test_col             |primary_key_col|timestamp              |
+-----------+------------+-----------+-----------------------+--------------------+---------------------+---------------+-----------------------+
|1          |has_bathroom|true       |__custom__test__value__|House has a bathroom|__custom__test__col__|id             |2025-10-13 15:30:59.902|
|2          |has_bathroom|false      |__custom__test__value__|House has a bathroom|__custom__test__col__|id             |2025-10-13 15:30:59.902|
+-----------+------------+-----------+-----------------------+--------------------+---------------------+---------------+-----------------------+

Despite that the data whether a house has a bath room is not available in the house DataFrame; we can still add the custom test to the DataFrameTester object.

df_tester.results.show(truncate=False)

+-----------+-------------------------------+-----------+-----------------------+--------------------------------------+---------------------+---------------+-----------------------+
|primary_key|test_name                      |test_result|test_value             |test_description                      |test_col             |primary_key_col|timestamp              |
+-----------+-------------------------------+-----------+-----------------------+--------------------------------------+---------------------+---------------+-----------------------+
|1          |street__ValidStreetFormat      |true       |Rochussenstraat        |Street is in valid Dutch street format|street               |id             |2025-10-13 15:31:20.538|
|2          |street__ValidStreetFormat      |true       |Coolsingel             |Street is in valid Dutch street format|street               |id             |2025-10-13 15:31:20.538|
|3          |street__ValidStreetFormat      |false      |%Witte de Withstraat   |Street is in valid Dutch street format|street               |id             |2025-10-13 15:31:20.538|
|4          |street__ValidStreetFormat      |true       |Lijnbaan               |Street is in valid Dutch street format|street               |id             |2025-10-13 15:31:20.538|
|5          |street__ValidStreetFormat      |false      |null                   |Street is in valid Dutch street format|street               |id             |2025-10-13 15:31:20.538|
|1          |house_number__ValidNumericRange|true       |27                     |ValidNumericRange                     |house_number         |id             |2025-10-13 15:31:20.538|
|2          |house_number__ValidNumericRange|true       |31                     |ValidNumericRange                     |house_number         |id             |2025-10-13 15:31:20.538|
|3          |house_number__ValidNumericRange|true       |27                     |ValidNumericRange                     |house_number         |id             |2025-10-13 15:31:20.538|
|4          |house_number__ValidNumericRange|false      |-3                     |ValidNumericRange                     |house_number         |id             |2025-10-13 15:31:20.538|
|5          |house_number__ValidNumericRange|true       |13                     |ValidNumericRange                     |house_number         |id             |2025-10-13 15:31:20.538|
|1          |has_bathroom                   |true       |__custom__test__value__|House has a bathroom                  |__custom__test__col__|id             |2025-10-13 15:31:20.538|
|2          |has_bathroom                   |false      |__custom__test__value__|House has a bathroom                  |__custom__test__col__|id             |2025-10-13 15:31:20.538|
+-----------+-------------------------------+-----------+-----------------------+--------------------------------------+---------------------+---------------+-----------------------+

We can also get a summary of the test results using the .summary attribute.

df_tester.summary.show(truncate=False)

+-------------------------------+--------------------------------------+---------------------+-------+--------+-----------------+--------+-----------------+---------------+-----------------------+
|test_name                      |test_description                      |test_col             |n_tests|n_passed|percentage_passed|n_failed|percentage_failed|primary_key_col|timestamp              |
+-------------------------------+--------------------------------------+---------------------+-------+--------+-----------------+--------+-----------------+---------------+-----------------------+
|has_bathroom                   |House has a bathroom                  |__custom__test__col__|2      |1       |50.0             |1       |50.0             |id             |2025-10-13 15:31:33.733|
|house_number__ValidNumericRange|ValidNumericRange                     |house_number         |5      |4       |80.0             |1       |20.0             |id             |2025-10-13 15:31:33.733|
|street__ValidStreetFormat      |Street is in valid Dutch street format|street               |5      |3       |60.0             |2       |40.0             |id             |2025-10-13 15:31:33.733|
+-------------------------------+--------------------------------------+---------------------+-------+--------+-----------------+--------+-----------------+---------------+-----------------------+

If you want to see all rows that failed any of the tests, you can use the .failed_tests attribute.

df_tester.failed_tests.show(truncate=False)

+-----------+-------------------------------+-----------+-----------------------+--------------------------------------+---------------------+
|primary_key|test_name                      |test_result|test_value             |test_description                      |test_col             |
+-----------+-------------------------------+-----------+-----------------------+--------------------------------------+---------------------+
|3          |street__ValidStreetFormat      |false      |%Witte de Withstraat   |Street is in valid Dutch street format|street               |
|5          |street__ValidStreetFormat      |false      |null                   |Street is in valid Dutch street format|street               |
|4          |house_number__ValidNumericRange|false      |-3                     |ValidNumericRange                     |house_number         |
|2          |has_bathroom                   |false      |__custom__test__value__|House has a bathroom                  |__custom__test__col__|
+-----------+-------------------------------+-----------+-----------------------+--------------------------------------+---------------------+

Of course, you can also see all rows that passed all tests using the .passed_tests attribute.

df_tester.passed_tests.show(truncate=False)

+-----------+-------------------------------+-----------+-----------------------+--------------------------------------+---------------------+
|primary_key|test_name                      |test_result|test_value             |test_description                      |test_col             |
+-----------+-------------------------------+-----------+-----------------------+--------------------------------------+---------------------+
|1          |street__ValidStreetFormat      |true       |Rochussenstraat        |Street is in valid Dutch street format|street               |
|2          |street__ValidStreetFormat      |true       |Coolsingel             |Street is in valid Dutch street format|street               |
|4          |street__ValidStreetFormat      |true       |Lijnbaan               |Street is in valid Dutch street format|street               |
|1          |house_number__ValidNumericRange|true       |27                     |ValidNumericRange                     |house_number         |
|2          |house_number__ValidNumericRange|true       |31                     |ValidNumericRange                     |house_number         |
|3          |house_number__ValidNumericRange|true       |27                     |ValidNumericRange                     |house_number         |
|5          |house_number__ValidNumericRange|true       |13                     |ValidNumericRange                     |house_number         |
|1          |has_bathroom                   |true       |__custom__test__value__|House has a bathroom                  |__custom__test__col__|
+-----------+-------------------------------+-----------+-----------------------+--------------------------------------+---------------------+

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tiddoloos tomergabay

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

3.1.0

Mar 26, 2026

3.0.0

Oct 14, 2025

2.9.1

Oct 13, 2025

2.9.0

Oct 12, 2025

2.8.0

Sep 16, 2025

2.7.0

Jun 3, 2025

2.6.0

May 15, 2025

2.5.0

Dec 17, 2024

2.4.1

Oct 10, 2024

2.4.0

Oct 3, 2024

2.3.1

Oct 3, 2024

2.3.0

Aug 15, 2024

2.2.3

Aug 13, 2024

2.2.1

Aug 12, 2024

2.1.1

Aug 12, 2024

2.1.0

Aug 12, 2024

2.0.0

Jun 22, 2024

1.1.0

Jun 15, 2024

1.0.1

Jun 15, 2024

1.0.0

Jun 14, 2024

0.2.0

Jun 14, 2024

0.1.0

Jun 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_testframework-3.1.0.tar.gz (78.5 kB view details)

Uploaded Mar 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyspark_testframework-3.1.0-py3-none-any.whl (23.8 kB view details)

Uploaded Mar 26, 2026 Python 3

File details

Details for the file pyspark_testframework-3.1.0.tar.gz.

File metadata

Download URL: pyspark_testframework-3.1.0.tar.gz
Upload date: Mar 26, 2026
Size: 78.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyspark_testframework-3.1.0.tar.gz
Algorithm	Hash digest
SHA256	`97ac5705e6b3f24f2bcfaa55a372f8a15b7835aef127a5bf0dfdf2ce278f92e0`
MD5	`fb0118c977b7e8f68d083f512670dff2`
BLAKE2b-256	`4f3701be8bb899cc5445318938de0a14d625af95d10205dc71eb66e4b6008beb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_testframework-3.1.0.tar.gz:

Publisher: publish-to-pypi.yml on woonstadrotterdam/pyspark-testframework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyspark_testframework-3.1.0.tar.gz
- Subject digest: 97ac5705e6b3f24f2bcfaa55a372f8a15b7835aef127a5bf0dfdf2ce278f92e0
- Sigstore transparency entry: 1185902506
- Sigstore integration time: Mar 26, 2026
Source repository:
- Permalink: woonstadrotterdam/pyspark-testframework@dba9ba91a0a92dd5d7cff4071469ae2e6c64c73e
- Branch / Tag: refs/tags/v3.1.0
- Owner: https://github.com/woonstadrotterdam
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@dba9ba91a0a92dd5d7cff4071469ae2e6c64c73e
- Trigger Event: push

File details

Details for the file pyspark_testframework-3.1.0-py3-none-any.whl.

File metadata

Download URL: pyspark_testframework-3.1.0-py3-none-any.whl
Upload date: Mar 26, 2026
Size: 23.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyspark_testframework-3.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9c28ee3397ca5fa4838b3fdb899cd40b064f392cf077df5232d73a2e22b7a2ac`
MD5	`a12dd6076496dde699ecfcfe57e602aa`
BLAKE2b-256	`d8339597dba1784d363451d455bd1b68c341afb63613d01052d873bf72ac012d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyspark_testframework-3.1.0-py3-none-any.whl:

Publisher: publish-to-pypi.yml on woonstadrotterdam/pyspark-testframework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyspark_testframework-3.1.0-py3-none-any.whl
- Subject digest: 9c28ee3397ca5fa4838b3fdb899cd40b064f392cf077df5232d73a2e22b7a2ac
- Sigstore transparency entry: 1185902511
- Sigstore integration time: Mar 26, 2026
Source repository:
- Permalink: woonstadrotterdam/pyspark-testframework@dba9ba91a0a92dd5d7cff4071469ae2e6c64c73e
- Branch / Tag: refs/tags/v3.1.0
- Owner: https://github.com/woonstadrotterdam
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@dba9ba91a0a92dd5d7cff4071469ae2e6c64c73e
- Trigger Event: push

pyspark-testframework 3.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pyspark-testframework

Test Results

Tutorial

Custom tests

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance