Shapley value transformations for behavioral data analysis
Project description
shapley_behaviors
Shapley value transformations for explainable behavioral data analysis.
Traditional clustering asks which samples are similar? but not why do they cluster together? Shapley behavioral transformations answer the "why" by decomposing statistical properties — variance, skewness, kurtosis, entropy — into individual sample contributions. Samples that cluster in behavioral space share the same statistical role in the dataset, providing mechanistic and actionable insights.
Implementation of the methodology from:
Liu, T., and Barnard, A. S. (2025). Understanding interpretable patterns of Shapley behaviours in materials data. Machine Learning: Engineering, 1, 015004. https://doi.org/10.1088/3049-4761/adaaf6
Features
- Decompose datasets into variance, skewness, kurtosis, and entropy behavioral spaces
- Parallel computation via joblib for large datasets
- Outlier detection directly in behavioral space
- Bundled interactive explorer scripts for Jupyter-based analysis with PCA plots, clustering statistics, and region-of-interest annotation
- Antithetic sampling for variance reduction in Monte Carlo estimation
Installation
pip install shapley_behaviors
Quick Start
import numpy as np
from shapley_behaviors import ShapleyBehaviors
X = np.random.randn(500, 20) # (n_samples, n_features)
sb = ShapleyBehaviors(n_permutations=100, n_jobs=-1, random_state=42)
# Transform to a single behavioral space
Phi_variance = sb.transform(X, value_function='variance')
# Or compute all four spaces at once
behavioral_spaces = sb.transform_multiple(X)
# keys: 'variance', 'skewness', 'kurtosis', 'entropy'
Outlier Detection
from shapley_behaviors import identify_outliers
outlier_indices, outlier_scores = identify_outliers(Phi_variance, threshold=2.5)
print(f"Detected {len(outlier_indices)} outliers")
Understanding Behavioral Spaces
Each space answers a different question about the role of each sample in the dataset:
| Space | Positive values | Negative values | Use case |
|---|---|---|---|
| Variance | Stretchers — widen the distribution | Stabilizers — typical samples near the mean | Quality control, process instability |
| Skewness | Pull distribution above the mean | Pull distribution below the mean | Biased synthesis, directional drift |
| Kurtosis | Tail samples — rare extreme events | Core samples — predictable, well-behaved | Anomaly detection, reliability analysis |
| Entropy | High-information — rare, unique combinations | Low-information — common, redundant | Dataset curation, diversity quantification |
Hopkins Statistic
The Hopkins statistic H measures clustering tendency in behavioral space:
| H value | Interpretation |
|---|---|
| > 0.7 | Strong clustering — samples group by behavior |
| ≈ 0.5 | Random distribution — no natural grouping |
| < 0.3 | Regular/uniform distribution |
Convenience Functions
from shapley_behaviors import (
compute_shapley_variance,
compute_shapley_skewness,
compute_shapley_kurtosis,
compute_shapley_entropy,
)
Phi = compute_shapley_variance(X, n_permutations=100, n_jobs=-1, random_state=42)
Explorer Scripts
The package bundles two standalone Jupyter-compatible scripts for comprehensive analysis. Copy them to your working directory:
from shapley_behaviors import copy_scripts
copy_scripts(".") # all scripts
copy_scripts("./analysis", scripts=["behavioral_space_explorer"]) # one script
Behavioral Space Explorer
Full dataset exploration — PCA plots, Hopkins statistics, outlier detection:
SEED = 42
N_PERMUTATIONS = 1000 # 100 for quick tests, 1000 for publication
N_JOBS = -1
DATASET_NAME = "mydata"
DATA_FILE = "mydata.csv"
ID_COLUMN = "sample_id"
DROP_COLUMNS = ["col_a", "col_b"]
LABEL_COLUMNS = ["target1", "target2", "category"]
OUTPUT_DIR = "behavioral_exploration"
SELECTED_FEATURES = ["feature1", "feature2"] # optional highlight
%run -i behavioral_space_explorer.py
Outputs:
| File | Contents |
|---|---|
{name}_behavioral_spaces.npy |
All four behavioral transformations |
{name}_behave_{space}_{label}.png |
PCA plots colored by each label |
{name}_hopkins_statistics.csv |
Clustering tendency metrics |
{name}_clustering_statistics.csv |
Variance explained, pairwise distances |
{name}_outliers_{space}.csv |
Outlier samples per space |
Behavioral Region Explorer
Targeted analysis of specific PCA regions:
BEHAVIORAL_SPACES_FILE = "behavioral_exploration/mydata_behavioral_spaces.npy"
PLOT_MODE = "combined" # or "separate"
USER_REGIONS = {
"high_variance_cluster": {
"space": "variance",
"pc1_range": (0.3, 0.6),
"pc2_range": (-0.2, 0.2),
"description": "High variance contributors",
"color": "red",
},
}
%run -i behavioral_region_explorer.py
Parameters
| Parameter | Values | Notes |
|---|---|---|
n_permutations |
50–100 (explore), 200–500 (standard), 1000+ (publication) | Higher = more accurate, slower |
n_jobs |
-1 (all cores), 1 (debug), N (N cores) |
Parallelises over features |
random_state |
any int | Set for reproducibility; uses antithetic sampling |
Runtime Estimates
| Dataset size | n_permutations | Estimated time |
|---|---|---|
| 500 samples | 100 | 2–5 min |
| 500 samples | 1000 | 15–30 min |
| 4000 samples | 100 | 20–30 min |
| 4000 samples | 1000 | 2–3 hours |
API Reference
# Main class
sb = ShapleyBehaviors(n_permutations=100, n_jobs=-1, random_state=42)
Phi = sb.transform(X, value_function='variance', verbose=True)
spaces = sb.transform_multiple(X, value_functions=['variance', 'skewness', 'kurtosis', 'entropy'])
# Outlier detection
outlier_indices, outlier_scores = identify_outliers(Phi, threshold=3.0, method='zscore')
Troubleshooting
| Problem | Solution |
|---|---|
ImportError |
pip install shapley_behaviors |
| Long runtime | Reduce n_permutations to 100 for testing |
| Memory error | Reduce n_jobs or process data in batches |
| High additivity error warning | Increase n_permutations |
| H ≈ 0.5 (no clustering) | Data may lack natural behavioral groupings |
Citation
@article{liu2025shapley,
author = {Liu, Tommy and Barnard, Amanda S.},
title = {Understanding interpretable patterns of {Shapley} behaviours in materials data},
journal = {Machine Learning: Engineering},
volume = {1},
pages = {015004},
year = {2025},
doi = {10.1088/3049-4761/adaaf6}
}
Links
MIT License — Copyright © 2024 Amanda S. Barnard and Tommy Liu
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file shapley_behaviors-0.1.2.tar.gz.
File metadata
- Download URL: shapley_behaviors-0.1.2.tar.gz
- Upload date:
- Size: 31.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acd1449e2fa7e54502556d4f93f02df799ea163aff47286e1e56a058e8117424
|
|
| MD5 |
3aa5d7231361a880beb3d4092b028d99
|
|
| BLAKE2b-256 |
a87ac41d63df1624c6ca9704fdc21897c4866a202e7c180ff40504e415436759
|
File details
Details for the file shapley_behaviors-0.1.2-py3-none-any.whl.
File metadata
- Download URL: shapley_behaviors-0.1.2-py3-none-any.whl
- Upload date:
- Size: 29.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9699d5108c33035879a252a12df6df54ccea5e849c514b2d26bf3ee00a47193a
|
|
| MD5 |
81545bc061f1a070785e12838778f970
|
|
| BLAKE2b-256 |
635f5a21a40bcb983026987ffd148f98754bd6d2be61f6e3811a534ce4c16d01
|