Shapley value transformations for behavioral data analysis
Project description
shapley_behaviors
Shapley value transformations for explainable behavioral data analysis.
Overview
Traditional clustering asks "which samples are similar?" but not "why do they cluster together?"
Shapley behavioral transformations answer the "why" by decomposing statistical properties (variance, skewness, kurtosis, entropy) into individual sample contributions. Samples that cluster in behavioral space share the same statistical role in the dataset, providing mechanistic and actionable insights.
This package implements the methodology from:
Liu, T., and Barnard, A. S. (2025). Understanding interpretable patterns of Shapley behaviours in materials data. Machine Learning: Engineering, 1, 015004. https://doi.org/10.1088/3049-4761/adaaf6
Installation
pip install shapley_behaviors
Quick Start
import numpy as np
from shapley_behaviors import ShapleyBehaviors
# Load your data (n_samples, n_features)
X = np.random.randn(500, 20)
# Transform to behavioral spaces
sb = ShapleyBehaviors(n_permutations=100, n_jobs=-1, random_state=42)
Phi_variance = sb.transform(X, value_function='variance')
Phi_skewness = sb.transform(X, value_function='skewness')
Phi_kurtosis = sb.transform(X, value_function='kurtosis')
Phi_entropy = sb.transform(X, value_function='entropy')
# Or compute all at once
behavioral_spaces = sb.transform_multiple(X)
Outlier Detection
from shapley_behaviors import identify_outliers
outlier_indices, outlier_scores = identify_outliers(Phi_kurtosis, threshold=2.5)
print(f"Detected {len(outlier_indices)} outliers")
Getting the Explorer Scripts
The package includes standalone explorer scripts for comprehensive analysis with visualizations, statistics, and outlier detection. Copy them to your working directory:
from shapley_behaviors import copy_scripts
# Copy all scripts to current directory
copy_scripts(".")
# Or copy to a specific directory
copy_scripts("./analysis")
# Or copy only one script
copy_scripts(".", scripts=["behavioral_space_explorer"])
Behavioral Space Explorer
Configure and run in Jupyter:
# Configuration
SEED = 42
N_PERMUTATIONS = 1000 # 100 for quick tests, 1000 for publication
N_JOBS = -1 # -1 uses all CPU cores
DATASET_NAME = "mydata"
DATA_FILE = "mydata.csv"
ID_COLUMN = "sample_id"
DROP_COLUMNS = ["col_a", "col_b"]
LABEL_COLUMNS = ["target1", "target2", "category"]
OUTPUT_DIR = "behavioral_exploration"
# Optional: select specific features to highlight
SELECTED_FEATURES = ["feature1", "feature2", "feature3"]
# Run the explorer
%run -i behavioral_space_explorer.py
The explorer generates:
{name}_behavioral_spaces.npy- All behavioral transformations{name}_behave_{space}_{label}.png- PCA plots colored by each label{name}_hopkins_statistics.csv- Clustering tendency metrics{name}_clustering_statistics.csv- Variance explained, pairwise distances{name}_outliers_{space}.csv- Outlier samples for each space
Behavioral Region Explorer
For targeted analysis of specific regions in behavioral space:
# Basic configuration
DATASET_NAME = "mydata"
DATA_FILE = "mydata.csv"
ID_COLUMN = "sample_id"
DROP_COLUMNS = ["col_a", "col_b"]
LABEL_COLUMNS = ["target1", "target2", "category"]
OUTPUT_DIR = "behavioral_exploration"
BEHAVIORAL_SPACES_FILE = "behavioral_exploration/mydata_behavioral_spaces.npy"
PLOT_MODE = "combined" # or "separate"
# Define regions of interest in PCA space
USER_REGIONS = {
"high_variance_cluster": {
"space": "variance",
"pc1_range": (0.3, 0.6),
"pc2_range": (-0.2, 0.2),
"description": "High variance contributors",
"color": "red"
},
"entropy_outliers": {
"space": "entropy",
"pc1_range": (-0.5, -0.2),
"pc2_range": (0.1, 0.4),
"description": "Low entropy samples",
"color": "blue"
}
}
# Run the region explorer
%run -i behavioral_region_explorer.py
Understanding Behavioral Spaces
Variance Space: Decomposes how each sample contributes to feature spread. Negative values indicate stabilizers (typical samples near the mean). Positive values indicate stretchers (extreme samples widening distribution). Use case: quality control, identifying process instability.
Skewness Space: Decomposes how each sample contributes to distributional asymmetry. Negative values pull distribution below mean. Positive values pull distribution above mean. Near-zero values maintain symmetry. Use case: detecting biased synthesis, directional process drift.
Kurtosis Space: Decomposes how each sample contributes to tail heaviness. Negative values indicate core samples (predictable, well-behaved). Positive values indicate tail samples (rare extreme events). Use case: risk assessment, anomaly detection, reliability analysis.
Entropy Space: Decomposes how each sample contributes to information content. Positive values indicate high-information samples (rare, unique feature combinations). Negative values indicate low-information samples (common, redundant). Use case: dataset curation, experimental design, diversity quantification.
Hopkins Statistic
The Hopkins statistic H measures clustering tendency:
- H > 0.7: Strong clustering (samples group by behavior)
- H approximately 0.5: Random distribution (no natural grouping)
- H < 0.3: Regular/uniform distribution
Parameter Selection
n_permutations: Controls Monte Carlo estimation accuracy.
- 50-100: Quick exploration, debugging
- 200-500: Standard analysis
- 1000+: Publication, final results
n_jobs: Parallel processing for feature columns.
- -1: Use all available CPU cores
- 1: Single-threaded (for debugging)
- N: Use N cores
random_state: Set for reproducibility. The implementation uses antithetic sampling for variance reduction.
API Reference
ShapleyBehaviors class:
sb = ShapleyBehaviors(n_permutations=100, n_jobs=-1, random_state=42)
Phi = sb.transform(X, value_function='variance', verbose=True)
spaces = sb.transform_multiple(X, value_functions=['variance', 'skewness', 'kurtosis', 'entropy'])
identify_outliers function:
outlier_indices, outlier_scores = identify_outliers(Phi, threshold=3.0, method='zscore')
Convenience functions:
from shapley_behaviors import (
compute_shapley_variance,
compute_shapley_skewness,
compute_shapley_kurtosis,
compute_shapley_entropy
)
Phi = compute_shapley_variance(X, n_permutations=100, n_jobs=-1, random_state=42)
Runtime Estimates
- 500 samples, 100 permutations: 2-5 minutes
- 500 samples, 1000 permutations: 15-30 minutes
- 4000 samples, 100 permutations: 20-30 minutes
- 4000 samples, 1000 permutations: 2-3 hours
Troubleshooting
ImportError: Ensure package is installed with pip install shapley_behaviors
Long runtime: Reduce N_PERMUTATIONS to 100 for testing
Memory error: Reduce N_JOBS or process data in batches
High additivity error warning: Increase n_permutations
No clustering detected (H approximately 0.5): Data may lack natural behavioral groupings
Citation
If you use this package, please cite:
Liu, T., and Barnard, A. S. (2025). Understanding interpretable patterns of
Shapley behaviours in materials data. Machine Learning: Engineering,
1, 015004. https://doi.org/10.1088/3049-4761/adaaf6
License
MIT License. See LICENSE file for details.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file shapley_behaviors-0.1.1.tar.gz.
File metadata
- Download URL: shapley_behaviors-0.1.1.tar.gz
- Upload date:
- Size: 26.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
daf29837876a6c6cee6e16f62fcc969aa17da0893edb3a1a3f2716fe30ef38d6
|
|
| MD5 |
10bf2535c6b4716f5226c0af1f2cf6e0
|
|
| BLAKE2b-256 |
71b7e66a503af23f16a49de3b277170a01122cfecb0089d8edfb6ac89c750c39
|
File details
Details for the file shapley_behaviors-0.1.1-py3-none-any.whl.
File metadata
- Download URL: shapley_behaviors-0.1.1-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20f37e7c1e06661f9c5dc9529c0620327c8c4076f3c23363e541dba4847226a8
|
|
| MD5 |
779e5ccfaf6fe96d03d67ed9f0db5cbc
|
|
| BLAKE2b-256 |
4c5d773a0ad4bd05db91a9809e73ebb74c4c96511dae56dfe4db0443b8579c08
|