Analyze Pandas dataframes, and other tabular data (csv), to find subgroups of data with properties that diverge from those of the overall dataset
Project description
DivExplorer
Machine learning models may perform differently on different data subgroups. We propose the notion of divergence over itemsets (i.e., conjunctions of simple predicates) as a measure of different classification behavior on data subgroups, and the use of frequent pattern mining techniques for their identification. We quantify the contribution of different attribute values to divergence with the notion of Shapley values to identify both critical and peculiar behaviors of attributes. See our paper and our project page for all the details.
Installation
Install using pip with:
pip install divexplorer
or, download a wheel or source archive from PyPI.
Example Notebooks
This notebook gives an example of how to use DivExplorer to find divergent subgroups in datasets and in the predictions of a classifier. You can also run the notebook directly on Colab.
Quick Start
DivExplorer works on Pandas datasets. Here we load an example one, and discretize in coarser ranges one of its attributes.
import pandas as pd
df_census = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/census_income.csv')
df_census["AGE_RANGE"] = df_census.apply(lambda row : 10 * (row["A_AGE"] // 10), axis=1)
We can then find the data subgroups that have highest income divergence, using the DivergenceExplorer
class as follows:
from divexplorer import DivergenceExplorer
fp_diver = DivergenceExplorer(df_census)
subgroups = fp_diver.get_pattern_divergence(min_support=0.001, quantitative_outcomes=["PTOTVAL"])
subgroups.sort_values(by="PTOTVAL_div", ascending=False).head(10)
Finding subgroups with divergent performance in classifiers
For classifiers, it may be of interest to find the subgroups with the highest (or lowest) divergence in characteristics such as false positive rates, etc. Here is how to do it for the false-positive rate in a COMPAS-derived classifier.
compas_df = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/compas_discretized.csv')
We generate an fp
column whose average will give the false-positive rate, like so:
from divexplorer.outcomes import get_false_positive_rate_outcome
y_trues = compas_df["class"]
y_preds = compas_df["predicted"]
compas_df['fp'] = get_false_positive_rate_outcome(y_trues, y_preds)
The fp
column has values:
- 1, if the data is a false positive (
class
is 0 andpredicted
is 1) - 0, if the data is a true negative (
class
is 0 andpredicted
is 0). - NaN, if the class is positive (
class
is 1).
We use Nan for class
1 data, to exclude those data from the average, so that the column average is the false-positive rate.
We can then find the most divergent groups as in the previous example, noting that here we use boolean_outcomes
rather than quantitative_outcomes
because fp
is boolean:
fp_diver = DivergenceExplorer(compas_df)
attributes = ['race', '#prior', 'sex', 'age']
FP_fm = fp_diver.get_pattern_divergence(min_support=0.1, attributes=attributes,
boolean_outcomes=['fp'])
FP_fm.sort_values(by="fp_div", ascending=False).head(10)
Note how we specify the attributes that can be used to define subgroups.
Analyzing subgroups via Shapley values
If we want to analyze what factors contribute to the divergence of a particular subgroup, we can do so via Shapley values:
fp_details = DivergencePatternProcessor(FP_fm, 'fp')
pattern = fp_details.patterns['itemset'].iloc[37]
fp_details.shapley_value(pattern)
Pruning redundant subgroups
If you get too many subgroups, you can prune redundant ones via redundancy pruning. This prunes a pattern $\beta$ if there is a pattern $\alpha$, subset of $\beta$, with a divergence difference below a threshold.
df_pruned = fp_details.redundancy_pruning(th_redundancy=0.01)
df_pruned.sort_values("fp_div", ascending=False).head(5)
Papers
The original paper is:
Looking for Trouble: Analyzing Classifier Behavior via Pattern Divergence. Eliana Pastor, Luca de Alfaro, Elena Baralis. In Proceedings of the 2021 ACM SIGMOD Conference, 2021.
You can find more papers in the project page.
Code Contributors
Project lead:
Other contributors:
Refer to CONTRIBUTING.md for info on contributing and releases/pre-releases.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file DivExplorer-0.2.1.tar.gz
.
File metadata
- Download URL: DivExplorer-0.2.1.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aeca4221c889ddc0708bcc3fe3946019cdf234a4ec4d404a5c789cb92c973c6f |
|
MD5 | 9d2f525af029f8d922a06ca03e2d9ae9 |
|
BLAKE2b-256 | dc7addc0a5e3448846e718a7727b223a574e54769821a8f86d0ff53908765ecb |
File details
Details for the file DivExplorer-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: DivExplorer-0.2.1-py3-none-any.whl
- Upload date:
- Size: 20.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab57c53ecbebffe5d1459a36e7334ce71e12a094953940c2313cc5b746825f6a |
|
MD5 | b491489e813bd0e58a38b37bfcacd7d6 |
|
BLAKE2b-256 | 1d53f28b351a6fdb8113497302e9b99cc5151790d69e021c7305040663aa774f |