Non-invasive health checks for Pandas method chains

These details have not been verified by PyPI

Project links

Project description

🐼🩺 Pandas Checks

Introduction

Pandas Checks is a Python library for data science and data engineering. It adds non-invasive health checks for Pandas method chains.

It can inspect and validate your data at various points in your Pandas pipelines, without modifying the underlying data.

So you don't need to chop up a functional method chain, or create intermediate variables, every time you need to diagnose, treat, or prevent problems with data processing.

As Fleetwood Mac says, you would never break the chain.

💡 Tip:
See the full documentation for all the details on the what, why, and how of Pandas Checks.

Installation

Pandas Checks supports Python versions 3.8-3.12.

To install:

pip install pandas-checks

Usage

After installing Pandas Checks, import it:

import pandas as pd
import pandas_checks

Now you can use .check on your Pandas DataFrames and Series. You don't need to access pandas_checks directly, just work with Pandas as you normally would. The new Pandas Checks methods are available when you work with Pandas in Jupyter, IPython, and terminal environments.

Here's a basic example of using Pandas Checks:

iris = pd.read_csv('iris.csv')

iris_new = (
    iris
    .check.assert_data(lambda df: (df['sepal_width']> 0).all(), fail_message="Sepal width can't be negative")  # Validate your data
    # ... Do your data processing in here ...
    .check.hist(column='petal_length')  # Plot a distribution
    .check.head(3)  # Display the first few rows
)

The .check methods will display the following results:

ⓘ Note:
These methods did not modify iris. That's the difference between Pandas .head() and Pandas Checks .check.head().

Methods available

Here's what's in the doctor's bag.

Describe
- Standard Pandas methods:
  - .check.columns()
  - .check.dtypes() (.check.dtype for Series)
  - .check.describe()
  - .check.head()
  - .check.info()
  - .check.memory_usage()
  - .check.nunique()
  - .check.shape()
  - .check.tail()
  - .check.unique()
  - .check.value_counts()
- New functions in Pandas Checks:
  - .check.function(): Apply an arbitrary lambda function to your data and see the result
  - .check.ncols()
  - .check.ndups()
  - .check.nnulls()
  - .check.print(): Print a string, a variable, or the current dataframe
Export interim files
- .check.write(): Export the current data, inferring file format from the name
Time your code
- .check.print_time_elapsed(start_time): Print the execution time since you called start_time = pdc.start_timer()
- Tip: You can also use the stopwatcht outside a method chain:
```
from pandas_checks import print_elapsed_time, start_timer

start_time = start_timer()
...
print_elapsed_time(start_time, units="seconds")
```
Turn off Pandas Checks
- .check.disable_checks(): Don't run checks in this method chain, for production mode etc
- .check.ensable_checks()
Validate Perform assertions on your data in the middle of a chain using .check.assert_data().
Visualize
- .check.hist(): Histogram
- .check.plot(): An arbitrary plot

Customizing results

You can use Pandas Checks methods like the regular Pandas methods. They accept the same arguments. For example, you can pass:

.check.head(7)
.check.value_counts(column="species", dropna=False, normalize=True)
.check.plot(kind="scatter", x="sepal_width", y="sepal_length").

Also, most Pandas Checks methods accept 3 additional arguments:

check_name: text to display before the result of the check
fn: a lambda function that modifies the data displayed by the check
subset: limit a check to certain columns

iris_new = (
    iris
    .check.value_counts(column='species', check_name="Varieties after data cleaning")
    .assign(species=lambda df: df["species"].str.upper()) # Do your regular Pandas data processing, like upper-casing the species column
    .check.head(n=2, fn=lambda df: df["petal_width"]*2) # Modify the data that gets displayed in the check only
    .check.describe(subset=['sepal_width', 'sepal_length'])  # Only check certain columns
)

Global configuration

You can customize Pandas Checks:

import pandas_checks as pdc

# Set output precision and turn off the cute emojis
pdc.set_format(precision=3, use_emojis=False)

# Don't run any of the calls to Pandas Checks, globally. Useful when switching your code to production mode
pdc.disable_checks()

💡 Tip:
Run pdc.describe_options() to see the arguments you can pass to .set_format().

You can also adjust settings within a method chain. This will set the global configuration. So if you only want the settings to be changed during the method chain, reset them at the end.

# Customize format
iris_new = (
    iris
    .check.set_format(precision=7, use_emojis=False)
    ... # Any .check methods in here will use the new format
    .check.reset_format() # Restore default format
)

# Turn off Pandas Checks
iris_new = (
    iris
    .check.disable_checks()
    ... # Any .check methods in here will not be run
    .check.enable_checks() # Turn it back on for the next code
)

Giving feedback and contributing

If you run into trouble or have questions, I'd love to know. Please open an issue.

Contributions are appreciated! Please see more details.

License

Pandas Checks is licensed under the BSD-3 License.

🐼🩺

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Jul 9, 2024

0.2.1

Jun 25, 2024

0.2.0

Jun 25, 2024

0.1.8

Jun 22, 2024

0.1.7

Jun 22, 2024

0.1.6

Jun 21, 2024

This version

0.1.5

Jun 21, 2024

0.1.4

Jun 21, 2024

0.1.3

Jun 21, 2024

0.1.1

Jun 21, 2024

0.1.0

Jun 21, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_checks-0.1.5.tar.gz (22.1 kB view hashes)

Uploaded Jun 21, 2024 Source

Built Distribution

pandas_checks-0.1.5-py3-none-any.whl (24.5 kB view hashes)

Uploaded Jun 21, 2024 Python 3

Hashes for pandas_checks-0.1.5.tar.gz

Hashes for pandas_checks-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`9a83cff77be2a70dcb3f730e2898ef17c74fa06da19a45edbe59a2bab39af3b2`
MD5	`7b0ebc09c291d5d79dd5a6bf1e472c42`
BLAKE2b-256	`08cd6c67879effc834257e1a8a1e2afb0c5f16c7fb3b466652c14056985d8765`

Hashes for pandas_checks-0.1.5-py3-none-any.whl

Hashes for pandas_checks-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6058bb067c312f7dfcc77594335301ae7b81e5bce41215b333c82e201ab1c306`
MD5	`ae6152895dfa6651063b313ce8773430`
BLAKE2b-256	`dc07c129101dd8a6438ba3185949a49fd7b02695b438e0d2554b8c35239fc520`