Skip to main content

Shapley value transformations for behavioral data analysis

Project description

shapley_behaviors

PyPI Python License: MIT

Shapley value transformations for explainable behavioral data analysis.

Traditional clustering asks which samples are similar? but not why do they cluster together? Shapley behavioral transformations answer the "why" by decomposing statistical properties — variance, skewness, kurtosis, entropy — into individual sample contributions. Samples that cluster in behavioral space share the same statistical role in the dataset, providing mechanistic and actionable insights.

Implementation of the methodology from:

Liu, T., and Barnard, A. S. (2025). Understanding interpretable patterns of Shapley behaviours in materials data. Machine Learning: Engineering, 1, 015004. https://doi.org/10.1088/3049-4761/adaaf6


Features

  • Decompose datasets into variance, skewness, kurtosis, and entropy behavioral spaces
  • Parallel computation via joblib for large datasets
  • Outlier detection directly in behavioral space
  • Bundled interactive explorer scripts for Jupyter-based analysis with PCA plots, clustering statistics, and region-of-interest annotation
  • Antithetic sampling for variance reduction in Monte Carlo estimation

Installation

pip install shapley_behaviors

Quick Start

import numpy as np
from shapley_behaviors import ShapleyBehaviors

X = np.random.randn(500, 20)  # (n_samples, n_features)

sb = ShapleyBehaviors(n_permutations=100, n_jobs=-1, random_state=42)

# Transform to a single behavioral space
Phi_variance = sb.transform(X, value_function='variance')

# Or compute all four spaces at once
behavioral_spaces = sb.transform_multiple(X)
# keys: 'variance', 'skewness', 'kurtosis', 'entropy'

Outlier Detection

from shapley_behaviors import identify_outliers

outlier_indices, outlier_scores = identify_outliers(Phi_variance, threshold=2.5)
print(f"Detected {len(outlier_indices)} outliers")

Understanding Behavioral Spaces

Each space answers a different question about the role of each sample in the dataset:

Space Positive values Negative values Use case
Variance Stretchers — widen the distribution Stabilizers — typical samples near the mean Quality control, process instability
Skewness Pull distribution above the mean Pull distribution below the mean Biased synthesis, directional drift
Kurtosis Tail samples — rare extreme events Core samples — predictable, well-behaved Anomaly detection, reliability analysis
Entropy High-information — rare, unique combinations Low-information — common, redundant Dataset curation, diversity quantification

Hopkins Statistic

The Hopkins statistic H measures clustering tendency in behavioral space:

H value Interpretation
> 0.7 Strong clustering — samples group by behavior
≈ 0.5 Random distribution — no natural grouping
< 0.3 Regular/uniform distribution

Convenience Functions

from shapley_behaviors import (
    compute_shapley_variance,
    compute_shapley_skewness,
    compute_shapley_kurtosis,
    compute_shapley_entropy,
)

Phi = compute_shapley_variance(X, n_permutations=100, n_jobs=-1, random_state=42)

Explorer Scripts

The package bundles two standalone Jupyter-compatible scripts for comprehensive analysis. Copy them to your working directory:

from shapley_behaviors import copy_scripts

copy_scripts(".")                                         # all scripts
copy_scripts("./analysis", scripts=["behavioral_space_explorer"])  # one script

Behavioral Space Explorer

Full dataset exploration — PCA plots, Hopkins statistics, outlier detection:

SEED = 42
N_PERMUTATIONS = 1000      # 100 for quick tests, 1000 for publication
N_JOBS = -1

DATASET_NAME = "mydata"
DATA_FILE = "mydata.csv"
ID_COLUMN = "sample_id"
DROP_COLUMNS = ["col_a", "col_b"]
LABEL_COLUMNS = ["target1", "target2", "category"]
OUTPUT_DIR = "behavioral_exploration"
SELECTED_FEATURES = ["feature1", "feature2"]  # optional highlight

%run -i behavioral_space_explorer.py

Outputs:

File Contents
{name}_behavioral_spaces.npy All four behavioral transformations
{name}_behave_{space}_{label}.png PCA plots colored by each label
{name}_hopkins_statistics.csv Clustering tendency metrics
{name}_clustering_statistics.csv Variance explained, pairwise distances
{name}_outliers_{space}.csv Outlier samples per space

Behavioral Region Explorer

Targeted analysis of specific PCA regions:

BEHAVIORAL_SPACES_FILE = "behavioral_exploration/mydata_behavioral_spaces.npy"
PLOT_MODE = "combined"  # or "separate"

USER_REGIONS = {
    "high_variance_cluster": {
        "space": "variance",
        "pc1_range": (0.3, 0.6),
        "pc2_range": (-0.2, 0.2),
        "description": "High variance contributors",
        "color": "red",
    },
}

%run -i behavioral_region_explorer.py

Parameters

Parameter Values Notes
n_permutations 50–100 (explore), 200–500 (standard), 1000+ (publication) Higher = more accurate, slower
n_jobs -1 (all cores), 1 (debug), N (N cores) Parallelises over features
random_state any int Set for reproducibility; uses antithetic sampling

Runtime Estimates

Dataset size n_permutations Estimated time
500 samples 100 2–5 min
500 samples 1000 15–30 min
4000 samples 100 20–30 min
4000 samples 1000 2–3 hours

API Reference

# Main class
sb = ShapleyBehaviors(n_permutations=100, n_jobs=-1, random_state=42)
Phi = sb.transform(X, value_function='variance', verbose=True)
spaces = sb.transform_multiple(X, value_functions=['variance', 'skewness', 'kurtosis', 'entropy'])

# Outlier detection
outlier_indices, outlier_scores = identify_outliers(Phi, threshold=3.0, method='zscore')

Troubleshooting

Problem Solution
ImportError pip install shapley_behaviors
Long runtime Reduce n_permutations to 100 for testing
Memory error Reduce n_jobs or process data in batches
High additivity error warning Increase n_permutations
H ≈ 0.5 (no clustering) Data may lack natural behavioral groupings

Citation

@article{liu2025shapley,
  author  = {Liu, Tommy and Barnard, Amanda S.},
  title   = {Understanding interpretable patterns of {Shapley} behaviours in materials data},
  journal = {Machine Learning: Engineering},
  volume  = {1},
  pages   = {015004},
  year    = {2025},
  doi     = {10.1088/3049-4761/adaaf6}
}

Links


MIT License — Copyright © 2024 Amanda S. Barnard and Tommy Liu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shapley_behaviors-0.1.2.tar.gz (31.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shapley_behaviors-0.1.2-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file shapley_behaviors-0.1.2.tar.gz.

File metadata

  • Download URL: shapley_behaviors-0.1.2.tar.gz
  • Upload date:
  • Size: 31.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for shapley_behaviors-0.1.2.tar.gz
Algorithm Hash digest
SHA256 acd1449e2fa7e54502556d4f93f02df799ea163aff47286e1e56a058e8117424
MD5 3aa5d7231361a880beb3d4092b028d99
BLAKE2b-256 a87ac41d63df1624c6ca9704fdc21897c4866a202e7c180ff40504e415436759

See more details on using hashes here.

File details

Details for the file shapley_behaviors-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for shapley_behaviors-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9699d5108c33035879a252a12df6df54ccea5e849c514b2d26bf3ee00a47193a
MD5 81545bc061f1a070785e12838778f970
BLAKE2b-256 635f5a21a40bcb983026987ffd148f98754bd6d2be61f6e3811a534ce4c16d01

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page