Monitor the stability of a pandas or spark dataset
Project description
popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.
popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using monitoring business rules.
Announcements
Spark 3.0
With Spark 3.0, based on Scala 2.12, make sure to pick up the correct histogrammar jar files:
spark = SparkSession.builder.config(
"spark.jars.packages",
"io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20",
).getOrCreate()
For Spark 2.X compiled against scala 2.11, in the string above simply replace 2.12 with 2.11.
Examples
Documentation
The entire popmon documentation including tutorials can be found at read-the-docs.
Notebooks
Tutorial |
Colab link |
---|---|
Detailed example (featuring configuration, Apache Spark and more) |
|
Check it out
The popmon library requires Python 3.6+ and is pip friendly. To get started, simply do:
$ pip install popmon
or check out the code from our GitHub repository:
$ git clone https://github.com/ing-bank/popmon.git
$ pip install -e popmon
where in this example the code is installed in edit mode (option -e).
You can now use the package in Python with:
import popmon
Congratulations, you are now ready to use the popmon library!
Quick run
As a quick example, you can do:
import pandas as pd
import popmon
from popmon import resources
# open synthetic data
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])
df.head()
# generate stability report using automatic binning of all encountered features
# (importing popmon automatically adds this functionality to a dataframe)
report = df.pm_stability_report(time_axis="date", features=["date:age", "date:gender"])
# to show the output of the report in a Jupyter notebook you can simply run:
report
# or save the report to file
report.to_file("monitoring_report.html")
To specify your own binning specifications and features you want to report on, you do:
# time-axis specifications alone; all other features are auto-binned.
report = df.pm_stability_report(
time_axis="date", time_width="1w", time_offset="2020-1-6"
)
# histogram selections. Here 'date' is the first axis of each histogram.
features = [
"date:isActive",
"date:age",
"date:eyeColor",
"date:gender",
"date:latitude",
"date:longitude",
"date:isActive:age",
]
# Specify your own binning specifications for individual features or combinations thereof.
# This bin specification uses open-ended ("sparse") histograms; unspecified features get
# auto-binned. The time-axis binning, when specified here, needs to be in nanoseconds.
bin_specs = {
"longitude": {"bin_width": 5.0, "bin_offset": 0.0},
"latitude": {"bin_width": 5.0, "bin_offset": 0.0},
"age": {"bin_width": 10.0, "bin_offset": 0.0},
"date": {
"bin_width": pd.Timedelta("4w").value,
"bin_offset": pd.Timestamp("2015-1-1").value,
},
}
# generate stability report
report = df.pm_stability_report(features=features, bin_specs=bin_specs, time_axis=True)
These examples also work with spark dataframes. You can see the output of such example notebook code here. For all available examples, please see the tutorials at read-the-docs.
Pipelines for monitoring dataset shift
Advanced users can leverage popmon’s modular data pipeline to customize their workflow. Visualization of the pipeline can be useful when debugging, or for didactic purposes. There is a script included with the package that you can use. The plotting is configurable, and depending on the options you will obtain a result that can be used for understanding the data flow, the high-level components and the (re)use of datasets.
Example pipeline visualization (click to enlarge)
Reports and integrations
The data shift computations that popmon performs, are by default displayed in a self-contained HTML report. This format is favourable in many real-world environments, where access may be restricted. Moreover, reports can be easily shared with others.
Access to the datastore means that its possible to integrate popmon in almost any workflow. To give an example, one could store the histogram data in a PostgreSQL database and load that from Grafana and benefit from their visualisation and alert handling features (e.g. send an email or slack message upon alert). This may be interesting to teams that are already invested in particular choice of dashboarding tool.
Possible integrations are:
Grafana |
Kibana |
Resources on how to integrate popmon are available in the examples directory. Contributions of additional or improved integrations are welcome!
Resources
Presentations
Title |
Host |
Date |
Speaker |
Popmon - population monitoring made easy |
February 25, 2021 |
Simon Brugman |
|
Popmon - population monitoring made easy |
October 29, 2020 |
Max Baak, Simon Brugman |
|
Popmon - population monitoring made easy |
October 16, 2020 |
Max Baak |
|
July 8 2020 |
Tomas Sostak |
||
June 16, 2020 |
Tomas Sostak |
||
Popmon: Population Shift Monitoring Made Easy |
June 4, 2020 |
Max Baak |
Articles
Title |
Date |
Author |
April 16, 2022 |
Jeanine Schoonemann |
|
April 15, 2022 |
Jurriaan Nagelkerke and Jeanine Schoonemann |
|
November 9, 2022 |
Simon Brugman |
|
Population Shift Analysis: Monitoring Data Quality with Popmon |
May 21, 2021 |
Vito Gentile |
Popmon Open Source Package — Population Shift Monitoring Made Easy |
May 20, 2020 |
Nicole Mpozika |
Software
Kedro-popmon is a plugin to integrate popmon reporting with kedro. This plugin allows you to automate the process of popmon feature and output stability monitoring. Package created by Marian Dabrowski and Stephane Collot.
Project contributors
This package was authored by ING Wholesale Banking Advanced Analytics. Special thanks to the following people who have contributed to the development of this package: Ahmet Erdem, Fabian Jansen, Nanne Aben, Mathieu Grimal.
Contact and support
Issues & Ideas & Support: https://github.com/ing-bank/popmon/issues
Please note that ING WBAA provides support only on a best-effort basis.
License
Copyright ING WBAA. popmon is completely free, open-source and licensed under the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.