Skip to main content

Analyze Pandas dataframes, and other tabular data (csv), to find subgroups of data with properties that diverge from those of the overall dataset

Project description

PyPI Downloads

DivExplorer

Machine learning models may perform differently on different data subgroups. We propose the notion of divergence over itemsets (i.e., conjunctions of simple predicates) as a measure of different classification behavior on data subgroups, and the use of frequent pattern mining techniques for their identification. We quantify the contribution of different attribute values to divergence with the notion of Shapley values to identify both critical and peculiar behaviors of attributes. See our paper and our project page for all the details.

Installation

Install using pip with:

pip install divexplorer

or, download a wheel or source archive from PyPI.

Example Notebooks

This notebook gives an example of how to use DivExplorer to find divergent subgroups in datasets and in the predictions of a classifier.

Documentation

For the code details, see the documentation.

The original paper is:

Looking for Trouble: Analyzing Classifier Behavior via Pattern Divergence. Eliana Pastor, Luca de Alfaro, Elena Baralis. In Proceedings of the 2021 ACM SIGMOD Conference, 2021.

You can find more papers and information in the DivExplorer project page.

Quick Start

DivExplorer works on Pandas datasets. Here we load an example one, and discretize in coarser ranges one of its attributes.

import pandas as pd

df_census = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/census_income.csv')
df_census["AGE_RANGE"] = df_census.apply(lambda row : 10 * (row["A_AGE"] // 10), axis=1)

We can then find the data subgroups that have highest income divergence, using the DivergenceExplorer class as follows:

from divexplorer import DivergenceExplorer

fp_diver = DivergenceExplorer(df_census)
subgroups = fp_diver.get_pattern_divergence(
    min_support=0.001,
    attributes=["STATE", "SEX", "EDUCATION", "AGE_RANGE"], 
    quantitative_outcomes=["PTOTVAL"])
subgroups.sort_values(by="PTOTVAL_div", ascending=False).head(10)

You can also prune redundant subgroups by specifying:

  • a threshold, so that attributes that don't increase the divergence by at least the threshold value are not included in subgroups,
  • a minimum t-value, to select only significant subgroups.
from divexplorer import DivergencePatternProcessor

processor = DivergencePatternProcessor(subgroups, "PTOTVAL")
pruned_subgroups = pd.DataFrame(processor.redundancy_pruning(th_redundancy=10000))
pruned_subgroups = pruned_subgroups[pruned_subgroups["PTOTVAL_t"] > 2]
pruned_subgroups.sort_values(by="PTOTVAL_div", ascending=False, ignore_index=True)

Finding subgroups with divergent performance in classifiers

For classifiers, it may be of interest to find the subgroups with the highest (or lowest) divergence in characteristics such as false positive rates, etc. Here is how to do it for the false-positive rate in a COMPAS-derived classifier.

compas_df = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/compas_discretized.csv')

We generate an fp column whose average will give the false-positive rate, like so:

from divexplorer.outcomes import get_false_positive_rate_outcome

y_trues = compas_df["class"]
y_preds = compas_df["predicted"]

compas_df['fp'] =  get_false_positive_rate_outcome(y_trues, y_preds)

The fp column has values:

  • 1, if the data is a false positive (class is 0 and predicted is 1)
  • 0, if the data is a true negative (class is 0 and predicted is 0).
  • NaN, if the class is positive (class is 1).

We use Nan for class 1 data, to exclude those data from the average, so that the column average is the false-positive rate. We can then find the most divergent groups as in the previous example, noting that here we use boolean_outcomes rather than quantitative_outcomes because fp is boolean:

fp_diver = DivergenceExplorer(compas_df)

attributes = ['race', '#prior', 'sex', 'age']
FP_fm = fp_diver.get_pattern_divergence(min_support=0.1, attributes=attributes, 
                                        boolean_outcomes=['fp'])
FP_fm.sort_values(by="fp_div", ascending=False).head(10)

Note how we specify the attributes that can be used to define subgroups. In the above code, we use boolean_outcomes because fp is boolean. The following example, from the example notebook, shows how to use quantitative_outcomes for a quantitative outcome.

df_census = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/census_income.csv')
explorer = DivergenceExplorer(df_census)
value_subgroups = explorer.get_pattern_divergence(
    min_support=0.001, quantitative_outcomes=["PTOTVAL"])

Analyzing subgroups via Shapley values

Returning to our COMPAS example, if we want to analyze what factors contribute to the divergence of a particular subgroup, we can do so via Shapley values:

fp_details = DivergencePatternProcessor(FP_fm, 'fp')

pattern = fp_details.patterns['itemset'].iloc[37]
fp_details.shapley_value(pattern)

Pruning redundant subgroups

If you get too many subgroups, you can prune redundant ones via redundancy pruning. This prunes a pattern $\beta$ if there is a pattern $\alpha$, subset of $\beta$, with a divergence difference below a threshold.

df_pruned = fp_details.redundancy_pruning(th_redundancy=0.01)
df_pruned.sort_values("fp_div", ascending=False).head(5)

Code Contributors

Project lead:

Other contributors:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

divexplorer-0.2.6.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

DivExplorer-0.2.6-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file divexplorer-0.2.6.tar.gz.

File metadata

  • Download URL: divexplorer-0.2.6.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for divexplorer-0.2.6.tar.gz
Algorithm Hash digest
SHA256 67703a761716a8abcad1d1f6bf971032e87dc05e914b41a4ae2eacb5cf49d759
MD5 f8832fa58f74374e5b79e4705e917cc3
BLAKE2b-256 9b33b1ea040d7a90db9a5cc3634e7788817d2b419499fb130708dc4a61490d5a

See more details on using hashes here.

File details

Details for the file DivExplorer-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: DivExplorer-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 21.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for DivExplorer-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 274c21c878affe6abef3d954ebc0697832409305d0b00a5bbcd09954b4c4c6cb
MD5 df1e11df0e28bf540b63d6564242ed4e
BLAKE2b-256 9f1da2a5195672f97555b0110b8b1658e606113228153f6395499e364ad687f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page