Pyspark test helper library

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mrpowers

These details have not been verified by PyPI

Project description

chispa

chispa provides fast PySpark test helper methods that output descriptive error messages.

This library makes it easy to write high quality PySpark code.

Fun fact: "chispa" means Spark in Spanish ;)

Installation

Install the latest version with

pip install chispa

Or if you use Poetry add this library as a development dependency with

poetry add chispa --group dev

Column equality

Suppose you have a function that removes the non-word characters in a string.

def remove_non_word_characters(col):
    return F.regexp_replace(col, "[^\\w\\s]+", "")

Create a SparkSession so you can create DataFrames.

from pyspark.sql import SparkSession

spark = (SparkSession.builder
  .master("local")
  .appName("chispa")
  .getOrCreate())

Create a DataFrame with a column that contains strings with non-word characters, run the remove_non_word_characters function, and check that all these characters are removed with the chispa assert_column_equality method.

import pytest

from chispa.column_comparer import assert_column_equality
import pyspark.sql.functions as F

def test_remove_non_word_characters_short():
    data = [
        ("jo&&se", "jose"),
        ("**li**", "li"),
        ("#::luisa", "luisa"),
        (None, None)
    ]
    df = (spark.createDataFrame(data, ["name", "expected_name"])
        .withColumn("clean_name", remove_non_word_characters(F.col("name"))))
    assert_column_equality(df, "clean_name", "expected_name")

Let's write another test that'll fail to see how the descriptive error message lets you easily debug the underlying issue.

Here's the failing test:

def test_remove_non_word_characters_nice_error():
    data = [
        ("matt7", "matt"),
        ("bill&", "bill"),
        ("isabela*", "isabela"),
        (None, None)
    ]
    df = (spark.createDataFrame(data, ["name", "expected_name"])
        .withColumn("clean_name", remove_non_word_characters(F.col("name"))))
    assert_column_equality(df, "clean_name", "expected_name")

Here's the nicely formatted error message:

ColumnsNotEqualError

You can see the matt7 / matt row of data is what's causing the error (note it's highlighted in red). The other rows are colored blue because they're equal.

DataFrame equality

We can also test the remove_non_word_characters method by creating two DataFrames and verifying that they're equal.

Creating two DataFrames is slower and requires more code, but comparing entire DataFrames is necessary for some tests.

from chispa.dataframe_comparer import *

def test_remove_non_word_characters_long():
    source_data = [
        ("jo&&se",),
        ("**li**",),
        ("#::luisa",),
        (None,)
    ]
    source_df = spark.createDataFrame(source_data, ["name"])

    actual_df = source_df.withColumn(
        "clean_name",
        remove_non_word_characters(F.col("name"))
    )

    expected_data = [
        ("jo&&se", "jose"),
        ("**li**", "li"),
        ("#::luisa", "luisa"),
        (None, None)
    ]
    expected_df = spark.createDataFrame(expected_data, ["name", "clean_name"])

    assert_df_equality(actual_df, expected_df)

Let's write another test that'll return an error, so you can see the descriptive error message.

def test_remove_non_word_characters_long_error():
    source_data = [
        ("matt7",),
        ("bill&",),
        ("isabela*",),
        (None,)
    ]
    source_df = spark.createDataFrame(source_data, ["name"])

    actual_df = source_df.withColumn(
        "clean_name",
        remove_non_word_characters(F.col("name"))
    )

    expected_data = [
        ("matt7", "matt"),
        ("bill&", "bill"),
        ("isabela*", "isabela"),
        (None, None)
    ]
    expected_df = spark.createDataFrame(expected_data, ["name", "clean_name"])

    assert_df_equality(actual_df, expected_df)

Here's the nicely formatted error message:

DataFramesNotEqualError

Ignore row order

You can easily compare DataFrames, ignoring the order of the rows. The content of the DataFrames is usually what matters, not the order of the rows.

Here are the contents of df1:

+--------+
|some_num|
+--------+
|       1|
|       2|
|       3|
+--------+

Here are the contents of df2:

+--------+
|some_num|
+--------+
|       2|
|       1|
|       3|
+--------+

Here's how to confirm df1 and df2 are equal when the row order is ignored.

assert_df_equality(df1, df2, ignore_row_order=True)

If you don't specify to ignore_row_order then the test will error out with this message:

ignore_row_order_false

The rows aren't ordered by default because sorting slows down the function.

Ignore column order

This section explains how to compare DataFrames, ignoring the order of the columns.

Suppose you have the following df1:

+----+----+
|num1|num2|
+----+----+
|   1|   7|
|   2|   8|
|   3|   9|
+----+----+

Here are the contents of df2:

+----+----+
|num2|num1|
+----+----+
|   7|   1|
|   8|   2|
|   9|   3|
+----+----+

Here's how to compare the equality of df1 and df2, ignoring the column order:

assert_df_equality(df1, df2, ignore_column_order=True)

Here's the error message you'll see if you run assert_df_equality(df1, df2), without ignoring the column order.

ignore_column_order_false

Ignore specific columns

This section explains how to compare DataFrames, ignoring specific columns.

Suppose you have the following df1:

+------------+-------------+
|    name    | clean_name  |
+------------+-------------+
| "matt7"    |   "matt7"    |
| "bill&"    |   "bill"    |
| "isabela*" |   "isabela" |
| "None"     |   "None"    |
+------------+-------------+

Here are the contents of df2:

+------------+-------------+
|    name    | clean_name  |
+------------+-------------+
| "matt7"    |   "matt"    |
| "bill&"    |   "bill"    |
| "isabela*" |   "isabela" |
| "None"     |   "None"    |
+------------+-------------+

Here's how to compare the equality of df1 and df2, ignoring the column clean_name:

assert_df_equality(df1, df2, ignore_columns=["clean_name"])

Here's the error message you'll see if you run assert_df_equality(df1, df2), without ignoring the column clean_name.

ignore_columns_none

Ignore nullability

Each column in a schema has three properties: a name, data type, and nullable property. The column can accept null values if nullable is set to true.

You'll sometimes want to ignore the nullable property when making DataFrame comparisons.

Suppose you have the following df1:

+-----+---+
| name|age|
+-----+---+
| juan|  7|
|bruna|  8|
+-----+---+

And this df2:

+-----+---+
| name|age|
+-----+---+
| juan|  7|
|bruna|  8|
+-----+---+

You might be surprised to find that in this example, df1 and df2 are not equal and will error out with this message:

nullable_off_error

Examine the code in this contrived example to better understand the error:

def ignore_nullable_property():
    s1 = StructType([
       StructField("name", StringType(), True),
       StructField("age", IntegerType(), True)])
    df1 = spark.createDataFrame([("juan", 7), ("bruna", 8)], s1)
    s2 = StructType([
       StructField("name", StringType(), True),
       StructField("age", IntegerType(), False)])
    df2 = spark.createDataFrame([("juan", 7), ("bruna", 8)], s2)
    assert_df_equality(df1, df2)

You can ignore the nullable property when assessing equality by adding a flag:

assert_df_equality(df1, df2, ignore_nullable=True)

Elements contained within an ArrayType() also have a nullable property, in addition to the nullable property of the column schema. These are also ignored when passing ignore_nullable=True.

Again, examine the following code to understand the error that ignore_nullable=True bypasses:

def ignore_nullable_property_array():
    s1 = StructType([
        StructField("name", StringType(), True),
        StructField("coords", ArrayType(DoubleType(), True), True),])
    df1 = spark.createDataFrame([("juan", [1.42, 3.5]), ("bruna", [2.76, 3.2])], s1)
    s2 = StructType([
        StructField("name", StringType(), True),
        StructField("coords", ArrayType(DoubleType(), False), True),])
    df2 = spark.createDataFrame([("juan", [1.42, 3.5]), ("bruna", [2.76, 3.2])], s2)
    assert_df_equality(df1, df2)

Allow NaN equality

Python has NaN (not a number) values and two NaN values are not considered equal by default. Create two NaN values, compare them, and confirm they're not considered equal by default.

nan1 = float('nan')
nan2 = float('nan')
nan1 == nan2 # False

pandas considers NaN values to be equal by default, but this library requires you to set a flag to consider two NaN values to be equal.

assert_df_equality(df1, df2, allow_nan_equality=True)

Customize formatting

You can specify custom formats for the printed error messages as follows:

from chispa import FormattingConfig

formats = FormattingConfig(
        mismatched_rows={"color": "light_yellow"},
        matched_rows={"color": "cyan", "style": "bold"},
        mismatched_cells={"color": "purple"},
        matched_cells={"color": "blue"},
    )

assert_basic_rows_equality(df1.collect(), df2.collect(), formats=formats)

or similarly:

from chispa import FormattingConfig, Color, Style

formats = FormattingConfig(
        mismatched_rows={"color": Color.LIGHT_YELLOW},
        matched_rows={"color": Color.CYAN, "style": Style.BOLD},
        mismatched_cells={"color": Color.PURPLE},
        matched_cells={"color": Color.BLUE},
    )

assert_basic_rows_equality(df1.collect(), df2.collect(), formats=formats)

You can also define these formats in conftest.py and inject them via a fixture:

@pytest.fixture()
def chispa_formats():
    return FormattingConfig(
        mismatched_rows={"color": "light_yellow"},
        matched_rows={"color": "cyan", "style": "bold"},
        mismatched_cells={"color": "purple"},
        matched_cells={"color": "blue"},
    )

def test_shows_assert_basic_rows_equality(chispa_formats):
  ...
  assert_basic_rows_equality(df1.collect(), df2.collect(), formats=chispa_formats)

custom_formats

Approximate column equality

We can check if columns are approximately equal, which is especially useful for floating number comparisons.

Here's a test that creates a DataFrame with two floating point columns and verifies that the columns are approximately equal. In this example, values are considered approximately equal if the difference is less than 0.1.

def test_approx_col_equality_same():
    data = [
        (1.1, 1.1),
        (2.2, 2.15),
        (3.3, 3.37),
        (None, None)
    ]
    df = spark.createDataFrame(data, ["num1", "num2"])
    assert_approx_column_equality(df, "num1", "num2", 0.1)

Here's an example of a test with columns that are not approximately equal.

def test_approx_col_equality_different():
    data = [
        (1.1, 1.1),
        (2.2, 2.15),
        (3.3, 5.0),
        (None, None)
    ]
    df = spark.createDataFrame(data, ["num1", "num2"])
    assert_approx_column_equality(df, "num1", "num2", 0.1)

This failing test will output a readable error message so the issue is easy to debug.

ColumnsNotEqualError

Approximate DataFrame equality

Let's create two DataFrames and confirm they're approximately equal.

def test_approx_df_equality_same():
    data1 = [
        (1.1, "a"),
        (2.2, "b"),
        (3.3, "c"),
        (None, None)
    ]
    df1 = spark.createDataFrame(data1, ["num", "letter"])

    data2 = [
        (1.05, "a"),
        (2.13, "b"),
        (3.3, "c"),
        (None, None)
    ]
    df2 = spark.createDataFrame(data2, ["num", "letter"])

    assert_approx_df_equality(df1, df2, 0.1)

The assert_approx_df_equality method is smart and will only perform approximate equality operations for floating point numbers in DataFrames. It'll perform regular equality for strings and other types.

Let's perform an approximate equality comparison for two DataFrames that are not equal.

def test_approx_df_equality_different():
    data1 = [
        (1.1, "a"),
        (2.2, "b"),
        (3.3, "c"),
        (None, None)
    ]
    df1 = spark.createDataFrame(data1, ["num", "letter"])

    data2 = [
        (1.1, "a"),
        (5.0, "b"),
        (3.3, "z"),
        (None, None)
    ]
    df2 = spark.createDataFrame(data2, ["num", "letter"])

    assert_approx_df_equality(df1, df2, 0.1)

Here's the pretty error message that's outputted:

DataFramesNotEqualError

Schema mismatch messages

DataFrame equality messages peform schema comparisons before analyzing the actual content of the DataFrames. DataFrames that don't have the same schemas should error out as fast as possible.

Let's compare a DataFrame that has a string column an integer column with a DataFrame that has two integer columns to observe the schema mismatch message.

def test_schema_mismatch_message():
    data1 = [
        (1, "a"),
        (2, "b"),
        (3, "c"),
        (None, None)
    ]
    df1 = spark.createDataFrame(data1, ["num", "letter"])

    data2 = [
        (1, 6),
        (2, 7),
        (3, 8),
        (None, None)
    ]
    df2 = spark.createDataFrame(data2, ["num", "num2"])

    assert_df_equality(df1, df2)

Here's the error message:

SchemasNotEqualError

Supported PySpark / Python versions

chispa currently supports PySpark 2.4+ and Python 3.5+.

Use chispa v0.8.2 if you're using an older Python version.

PySpark 2 support will be dropped when chispa 1.x is released.

Benchmarks

TODO: Need to benchmark these methods vs. the spark-testing-base ones

Developing chispa on your local machine

You are encouraged to clone and/or fork this repo.

This project uses Poetry for packaging and dependency management.

Setup the virtual environment with poetry install
Run the tests with poetry run pytest tests

Studying the codebase is a great way to learn about PySpark!

Contributing

Anyone is encouraged to submit a pull request, open an issue, or submit a bug report.

We're happy to promote folks to be library maintainers if they make good contributions.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mrpowers

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.12.0

Mar 24, 2026

0.11.1

Apr 12, 2025

0.11.0

Mar 31, 2025

0.10.1

Jul 31, 2024

0.10.0

Feb 20, 2024

0.9.4

Oct 1, 2023

0.9.3

Sep 8, 2023

0.9.2

Feb 21, 2022

0.9.1

Feb 21, 2022

0.9.0

Feb 21, 2022

0.8.3

Feb 18, 2022

0.8.2

Apr 22, 2021

0.8.1

Mar 30, 2021

0.8.0

Mar 26, 2021

0.7.1

Mar 23, 2021

0.7.0

Feb 6, 2021

0.6.0

Jul 8, 2020

0.5.0

Jun 22, 2020

0.4.0

Jun 13, 2020

0.3.0

Jun 9, 2020

0.2.1

Jun 6, 2020

0.2.0

Jun 6, 2020

0.0.2

Mar 24, 2019

0.0.1

Mar 23, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chispa-0.12.0.tar.gz (19.7 kB view details)

Uploaded Mar 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chispa-0.12.0-py3-none-any.whl (21.0 kB view details)

Uploaded Mar 24, 2026 Python 3

File details

Details for the file chispa-0.12.0.tar.gz.

File metadata

Download URL: chispa-0.12.0.tar.gz
Upload date: Mar 24, 2026
Size: 19.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chispa-0.12.0.tar.gz
Algorithm	Hash digest
SHA256	`4dcf4de9481979af77f8d697cb0f27f2599c21244537b4995204cdc529ce9a1b`
MD5	`1dc865c9e6a67f48aa6c16e2e9712cac`
BLAKE2b-256	`806b271e73b92c72ae4c4223a9e98ec5d7c8cb6349f33e33e64a2688116751b7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chispa-0.12.0.tar.gz:

Publisher: release.yml on MrPowers/chispa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chispa-0.12.0.tar.gz
- Subject digest: 4dcf4de9481979af77f8d697cb0f27f2599c21244537b4995204cdc529ce9a1b
- Sigstore transparency entry: 1172718790
- Sigstore integration time: Mar 24, 2026
Source repository:
- Permalink: MrPowers/chispa@a0fbeb8b21ba515e1101b451a66814411865b767
- Branch / Tag: refs/tags/v0.12.0
- Owner: https://github.com/MrPowers
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a0fbeb8b21ba515e1101b451a66814411865b767
- Trigger Event: release

File details

Details for the file chispa-0.12.0-py3-none-any.whl.

File metadata

Download URL: chispa-0.12.0-py3-none-any.whl
Upload date: Mar 24, 2026
Size: 21.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chispa-0.12.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e7a04239d5edf2001f3d5b1f349ce6378233684901915e8c984609ea5837f91d`
MD5	`9a6676a7f365de7cdc3dc38b5528ea16`
BLAKE2b-256	`80199a431f9f52cbaa5ba5b0d46f489337f51948cf579cb7eea7137c675fdb73`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chispa-0.12.0-py3-none-any.whl:

Publisher: release.yml on MrPowers/chispa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chispa-0.12.0-py3-none-any.whl
- Subject digest: e7a04239d5edf2001f3d5b1f349ce6378233684901915e8c984609ea5837f91d
- Sigstore transparency entry: 1172718907
- Sigstore integration time: Mar 24, 2026
Source repository:
- Permalink: MrPowers/chispa@a0fbeb8b21ba515e1101b451a66814411865b767
- Branch / Tag: refs/tags/v0.12.0
- Owner: https://github.com/MrPowers
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a0fbeb8b21ba515e1101b451a66814411865b767
- Trigger Event: release

chispa 0.12.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

chispa

Installation

Column equality

DataFrame equality

Ignore row order

Ignore column order

Ignore specific columns

Ignore nullability

Allow NaN equality

Customize formatting

Approximate column equality

Approximate DataFrame equality

Schema mismatch messages

Supported PySpark / Python versions

Benchmarks

Developing chispa on your local machine

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance