Skip to main content

A python package for defensive data analysis.

Project description

Bulwark's Documentation

downloads

latest release supported python versions package status conda license

travis build status docs build status coverage status

Bulwark is a package for convenient property-based testing of pandas dataframes.

Documentation: https://bulwark.readthedocs.io/en/latest/index.html

This project was heavily influenced by the no-longer-supported Engarde library by Tom Augspurger(thanks for the head start, Tom!), which itself was modeled after the R library assertr.

Why?

Data are messy, and pandas is one of the go-to libraries for analyzing tabular data. In the real world, data analysts and scientists often feel like they don't have the time or energy to think of and write tests for their data. Bulwark's goal is to let you check that your data meets your assumptions of what it should look like at any (and every) step in your code, without making you work too hard.

Installation

pip install bulwark

or

conda install -c conda-forge bulwark

Note that the latest version of Bulwark will only be compatible with newer version of Python, Numpy, and Pandas. This is to encourage upgrades that themselves can help minimize bugs, allow Bulwark to take advantage of the latest language/library features, reduce the technical debt of maintaining Bulwark, and to be consistent with Numpy's community version support recommendation in NEP 29. See the table below for officially supported versions:

Bulwark Python Numpy Pandas
0.6.0 >=3.6 >=1.15 >=0.23.0
<=0.5.3 >=3.5 >=1.8 >=0.16.2

Usage

Bulwark comes with checks for many of the common assumptions you might want to validate for the functions that make up your ETL pipeline, and lets you toss those checks as decorators on the functions you're already writing:

    import bulwark.decorators as dc

    @dc.IsShape((-1, 10))
    @dc.IsMonotonic(strict=True)
    @dc.HasNoNans()
    def compute(df):
        # complex operations to determine result
        ...
    return result_df

Still want to have more robust test files? Bulwark's got you covered there, too, with importable functions.

    import bulwark.checks as ck

    df.pipe(ck.has_no_nans())

Won't I have to go clean up all those decorators when I'm ready to go to production? Nope - just toggle the built-in "enabled" flag available for every decorator.

    @dc.IsShape((3, 2), enabled=False)
    def compute(df):
        # complex operations to determine result
        ...
    return result_df

What if the test I want isn't part of the library? Use the built-in CustomCheck to use your own custom function!

    import bulwark.checks as ck
    import bulwark.decorators as dc
    import numpy as np
    import pandas as pd

    def len_longer_than(df, l):
        if len(df) <= l:
            raise AssertionError("df is not as long as expected.")
        return df

    @dc.CustomCheck(len_longer_than, 10, enabled=False)
    def append_a_df(df, df2):
        return df.append(df2, ignore_index=True)

    df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
    df2 = pd.DataFrame({"a": [1, np.nan, 3, 4], "b": [4, 5, 6, 7]})

    append_a_df(df, df2)  # doesn't fail because the check is disabled

What if I want to run a lot of tests and want to see all the errors at once? You can use the built-in MultiCheck. It will collect all of the errors and either display a warning message of throw an exception based on the warn flag. You can even use custom functions with MultiCheck:

    def len_longer_than(df, l):
        if len(df) <= l:
            raise AssertionError("df is not as long as expected.")
        return df

    # `checks` takes a dict of function: dict of params for that function.
    # Note that those function params EXCLUDE df.
    # Also note that when you use MultiCheck, there's no need to use CustomCheck - just feed in the function.
    @dc.MultiCheck(checks={ck.has_no_nans: {"columns": None},
                           len_longer_than: {"l": 6}},
                   warn=False)
    def append_a_df(df, df2):
        return df.append(df2, ignore_index=True)

    df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
    df2 = pd.DataFrame({"a": [1, np.nan, 3, 4], "b": [4, 5, 6, 7]})

    append_a_df(df, df2)

See examples to see more advanced usage.

Contributing

Bulwark is always looking for new contributors! We work hard to make contributing as easy as possible, and previous open source experience is not required! Please see contributing.md for how to get started.

Thank you to all our past contributors, especially these folks:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bulwark-0.6.1.tar.gz (15.1 kB view details)

Uploaded Source

File details

Details for the file bulwark-0.6.1.tar.gz.

File metadata

  • Download URL: bulwark-0.6.1.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.7

File hashes

Hashes for bulwark-0.6.1.tar.gz
Algorithm Hash digest
SHA256 18a61b1f7bb5af6495551a4381e9f511d6c4daf09c168e00f88a81d7f8bb143f
MD5 b34213e913ed20dbd99e82195837b989
BLAKE2b-256 ae4bc3efa862d567c5954da1b21f8949ae56ffe4a635f95f51bb0a9f5543ac5c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page