Skip to main content

Monitor the stability of a pandas or spark dataset

Project description

Build status Package docs status Latest GitHub release GitHub Release Date

POPMON logo

popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using monitoring business rules.

Documentation

The entire popmon documentation including tutorials can be found at read-the-docs.

Examples

Notebooks

Tutorial

Colab link

Basic tutorial

Open in Colab

Detailed example (featuring configuration, Apache Spark and more)

Open in Colab

Incremental datasets (online analysis)

Open in Colab

Check it out

The popmon library requires Python 3.6+ and is pip friendly. To get started, simply do:

$ pip install popmon

or check out the code from our GitHub repository:

$ git clone https://github.com/ing-bank/popmon.git
$ pip install -e popmon

where in this example the code is installed in edit mode (option -e).

You can now use the package in Python with:

import popmon

Congratulations, you are now ready to use the popmon library!

Quick run

As a quick example, you can do:

import pandas as pd
import popmon
from popmon import resources

# open synthetic data
df = pd.read_csv(resources.data('test.csv.gz'), parse_dates=['date'])
df.head()

# generate stability report using automatic binning of all encountered features
# (importing popmon automatically adds this functionality to a dataframe)
report = df.pm_stability_report(time_axis='date', features=['date:age', 'date:gender'])

# to show the output of the report in a Jupyter notebook you can simply run:
report

# or save the report to file and open in a browser
report.to_file("monitoring_report.html")

To specify your own binning specifications and features you want to report on, you do:

# time-axis specifications alone; all other features are auto-binned.
report = df.pm_stability_report(time_axis='date', time_width='1w', time_offset='2020-1-6')

# histogram selections. Here 'date' is the first axis of each histogram.
features=[
    'date:isActive', 'date:age', 'date:eyeColor', 'date:gender',
    'date:latitude', 'date:longitude', 'date:isActive:age'
]

# Specify your own binning specifications for individual features or combinations thereof.
# This bin specification uses open-ended ("sparse") histograms; unspecified features get
# auto-binned. The time-axis binning, when specified here, needs to be in nanoseconds.
bin_specs={
    'longitude': {'bin_width': 5.0, 'bin_offset': 0.0},
    'latitude': {'bin_width': 5.0, 'bin_offset': 0.0},
    'age': {'bin_width': 10.0, 'bin_offset': 0.0},
    'date': {'bin_width': pd.Timedelta('4w').value,
             'bin_offset': pd.Timestamp('2015-1-1').value}
}

# generate stability report
report = df.pm_stability_report(features=features, bin_specs=bin_specs, time_axis=True)

These examples also work with spark dataframes. You can see the output of such example notebook code here. For all available examples, please see the tutorials at read-the-docs.

Project contributors

This package was authored by ING Wholesale Banking Advanced Analytics. Special thanks to the following people who have contributed to the development of this package: Ahmet Erdem, Fabian Jansen, Nanne Aben, Mathieu Grimal.

Contact and support

Please note that ING WBAA provides support only on a best-effort basis.

License

Copyright ING WBAA. popmon is completely free, open-source and licensed under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

popmon-0.3.8.tar.gz (314.4 kB view details)

Uploaded Source

Built Distribution

popmon-0.3.8-py3-none-any.whl (371.7 kB view details)

Uploaded Python 3

File details

Details for the file popmon-0.3.8.tar.gz.

File metadata

  • Download URL: popmon-0.3.8.tar.gz
  • Upload date:
  • Size: 314.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for popmon-0.3.8.tar.gz
Algorithm Hash digest
SHA256 ed134eafa9a6ff5f50ad47c006784183ae6cd4a7acb36dc42534017c79967cf0
MD5 734f5727b86a2525c754fe717fecf34e
BLAKE2b-256 f1a05f1a35284f438fc217de988706b0d2e34fd96f56851c8b4219a96e425fc2

See more details on using hashes here.

Provenance

File details

Details for the file popmon-0.3.8-py3-none-any.whl.

File metadata

  • Download URL: popmon-0.3.8-py3-none-any.whl
  • Upload date:
  • Size: 371.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3

File hashes

Hashes for popmon-0.3.8-py3-none-any.whl
Algorithm Hash digest
SHA256 689ed3f2f2d61d627442ee41e4d30d85d1d2c6e702d1ba5f591335fd2e158a79
MD5 85ac3ce63eb9d73dcdbef160efd5fc8d
BLAKE2b-256 fef1896308d5a311eea5f01a490ca7c728f28f0955bc1c9a6cadfe672a3480e0

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page