Skip to main content

Non-invasive health checks for Pandas method chains

Project description

Pandas Checks

PyPI - Python Version

Banner image for Pandas Checks

What is it?

Pandas Checks is a Python package for data science and data engineering. It adds non-invasive health checks for Pandas method chains.

It can inspect and validate your data at various points in your Pandas pipelines, without modifying the underlying data.

So you don't need to chop up a functional method chain, or create intermediate variables, every time you need to diagnose, treat, or prevent problems with data processing.

As Fleetwood Mac says, you would never break the chain.

💡 Tip:
See the full documentation for all the details on the what, why, and how of Pandas Checks.

Table of Contents

Installation

pip install pandas-checks
import pandas_checks

It works in Jupyter notebooks, IPython, and Python scripts run from the command line.

Usage

Pandas Checks adds .check methods to Pandas DataFrames and Series.

Say you have a nice function.

def clean_iris_data(iris: pd.DataFrame) -> pd.DataFrame:
    """Preprocess data about pretty flowers.

    Args:
        iris: The raw iris dataset.

    Returns:
        The cleaned iris dataset.
    """

    return (
        iris
        .dropna()
        .rename(columns={"FLOWER_SPECIES": "species"})
        .query("species=='setosa'")
    )

But what if you want to make the chain more robust? Or see what's happening to the data as it flows down the pipeline? Or understand why your new iris CSV suddenly makes the cleaned data look weird?

You can add some .check steps.

(
    iris
    .dropna()
    .rename(columns={"FLOWER_SPECIES": "species"})

    # Validate assumptions
    .check.assert_positive(subset=["petal_length", "sepal_length"])

    # Plot the distribution of a column after cleaning
    .check.hist(column='petal_length') 

    .query("species=='setosa'")
    
    # Display the first few rows after cleaning
    .check.head(3)  
)

The .check methods will display the following results:

Sample output

The .check methods didn't modify how the iris data is processed by your code. They just let you check the data as it flows down the pipeline. That's the difference between Pandas .head() and Pandas Checks .check.head().

Methods available

Here's what's in the doctor's bag.

Describe data

Standard Pandas methods:

New methods in Pandas Checks:

Export interim files

  • .check.write(): Export the current data, inferring file format from the name - DataFrame | Series

Time your code

  • .check.print_time_elapsed(start_time): Print the execution time since you called start_time = pdc.start_timer() - DataFrame | Series

💡 Tip: You can also use this stopwatch outside a method chain, anywhere in your Python code:

from pandas_checks import print_elapsed_time, start_timer

start_time = start_timer()
...
print_elapsed_time(start_time)

Turn Pandas Checks on or off

These methods can be used to disable subsequent Pandas Checks methods, either temporarily for a single method chain or permanently such as in a production environment.

  • .check.disable_checks(): Don't run checks. By default, still runs assertions. - DataFrame | Series
  • .check.enable_checks(): Run checks again. - DataFrame | Series

Validate data

Custom:

  • .check.assert_data(): Check that data passes an arbitrary condition - DataFrame | Series

Types:

Values:

Visualize data

Customizing a check

You can use Pandas Checks methods like the regular Pandas methods. They accept the same arguments. For example, you can pass:

  • .check.head(7)
  • .check.value_counts(column="species", dropna=False, normalize=True)
  • .check.plot(kind="scatter", x="sepal_width", y="sepal_length")

Also, most Pandas Checks methods accept 3 additional arguments:

  1. check_name: text to display before the result of the check
  2. fn: a lambda function that modifies the data displayed by the check
  3. subset: limit a check to certain columns
(
    iris
    .check.value_counts(column='species', check_name="Varieties after data cleaning")
    .assign(species=lambda df: df["species"].str.upper()) # Do your regular Pandas data processing, like upper-casing the values in one column
    .check.head(n=2, fn=lambda df: df["petal_width"]*2) # Modify the data that gets displayed in the check only
    .check.describe(subset=['sepal_width', 'sepal_length'])  # Only apply the check to certain columns
)



Power user output

Configuring Pandas Checks

Global configuration

You can change how Pandas Checks works everywhere. For example:

import pandas_checks as pdc

# Set output precision and turn off the cute emojis
pdc.set_format(precision=3, use_emojis=False)

# Don't run any of the calls to Pandas Checks, globally. Useful when switching your code to production mode
pdc.disable_checks()

Run pdc.describe_options() to see the arguments you can pass to .set_format().

💡 Tip:
By default, disable_checks() and enable_checks() do not change whether Pandas Checks will run assertion methods (.check.assert_*).

To turn off assertions too, add the argument enable_asserts=False, such as: disable_checks(enable_asserts=False).

Local configuration

You can also adjust settings within a method chain by bookending the chain, like this:

# Customize format during one method chain
(
    iris
    .check.set_format(precision=7, use_emojis=False)
    ... # Any .check methods in here will use the new format
    .check.reset_format() # Restore default format
)

# Turn off Pandas Checks during one method chain
(
    iris
    .check.disable_checks()
    ... # Any .check methods in here will not be run
    .check.enable_checks() # Turn it back on for the next code
)

💡 Tip: Hybrid EDA-Prod data processing

Exploratory data analysis (EDA) is traditionally thought of as the first step of data projects. But often when we're in production, we wish we could reuse parts of the EDA. Maybe we're debugging, editing prod code, or need to change the input data. Unfortunately, the original EDA code is often too stale to fire up again. The prod pipeline has changed too much.

If you used Pandas Checks during EDA, you can keep your .check methods in your first prod code. In production, you can disable Pandas Checks, but enable it when you need it. This can make your prod pipline more transparent and easier to inspect.

Giving feedback and contributing

If you run into trouble or have questions, I'd love to know. Please open an issue.

Contributions are appreciated! Please see more details.

License

Pandas Checks is licensed under the BSD-3 License.

🐼🩺

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_checks-0.3.0.tar.gz (33.3 kB view details)

Uploaded Source

Built Distribution

pandas_checks-0.3.0-py3-none-any.whl (33.1 kB view details)

Uploaded Python 3

File details

Details for the file pandas_checks-0.3.0.tar.gz.

File metadata

  • Download URL: pandas_checks-0.3.0.tar.gz
  • Upload date:
  • Size: 33.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/24.0.0

File hashes

Hashes for pandas_checks-0.3.0.tar.gz
Algorithm Hash digest
SHA256 6ff731ad0bc1356c75d482621755b01d77e8de4173a1190e11dfcfc558be2a00
MD5 7189a3f363fabf7363aaf6e55b5cf7f4
BLAKE2b-256 9edbd51cdec9317685fe658566a64ca0fc3cdf922be8b4e451b7c19f1ff220ae

See more details on using hashes here.

File details

Details for the file pandas_checks-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pandas_checks-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 33.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/24.0.0

File hashes

Hashes for pandas_checks-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 749f5ca31e27c1c612e08f87ce85a83534a29fabe4ed097f896bcc540e1955ed
MD5 a33f32ddb0f4c7c8dba816aa5c30728b
BLAKE2b-256 3735074e6ed0b09f58d3b882439d8e461b658d82ce09a02e1895a0664dcb8559

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page