Skip to main content

Shapley value transformations for behavioral data analysis

Project description

shapley_behaviors

Shapley value transformations for explainable behavioral data analysis.

Overview

Traditional clustering asks "which samples are similar?" but not "why do they cluster together?"

Shapley behavioral transformations answer the "why" by decomposing statistical properties (variance, skewness, kurtosis, entropy) into individual sample contributions. Samples that cluster in behavioral space share the same statistical role in the dataset, providing mechanistic and actionable insights.

This package implements the methodology from:

Liu, T., and Barnard, A. S. (2025). Understanding interpretable patterns of Shapley behaviours in materials data. Machine Learning: Engineering, 1, 015004. https://doi.org/10.1088/3049-4761/adaaf6

Installation

pip install shapley_behaviors

Quick Start

import numpy as np
from shapley_behaviors import ShapleyBehaviors

# Load your data (n_samples, n_features)
X = np.random.randn(500, 20)

# Transform to behavioral spaces
sb = ShapleyBehaviors(n_permutations=100, n_jobs=-1, random_state=42)

Phi_variance = sb.transform(X, value_function='variance')
Phi_skewness = sb.transform(X, value_function='skewness')
Phi_kurtosis = sb.transform(X, value_function='kurtosis')
Phi_entropy = sb.transform(X, value_function='entropy')

# Or compute all at once
behavioral_spaces = sb.transform_multiple(X)

Outlier Detection

from shapley_behaviors import identify_outliers

outlier_indices, outlier_scores = identify_outliers(Phi_kurtosis, threshold=2.5)
print(f"Detected {len(outlier_indices)} outliers")

Getting the Explorer Scripts

The package includes standalone explorer scripts for comprehensive analysis with visualizations, statistics, and outlier detection. Copy them to your working directory:

from shapley_behaviors import copy_scripts

# Copy all scripts to current directory
copy_scripts(".")

# Or copy to a specific directory
copy_scripts("./analysis")

# Or copy only one script
copy_scripts(".", scripts=["behavioral_space_explorer"])

Behavioral Space Explorer

Configure and run in Jupyter:

# Configuration
SEED = 42
N_PERMUTATIONS = 1000      # 100 for quick tests, 1000 for publication
N_JOBS = -1                # -1 uses all CPU cores

DATASET_NAME = "mydata"
DATA_FILE = "mydata.csv"
ID_COLUMN = "sample_id"
DROP_COLUMNS = ["col_a", "col_b"]
LABEL_COLUMNS = ["target1", "target2", "category"]
OUTPUT_DIR = "behavioral_exploration"

# Optional: select specific features to highlight
SELECTED_FEATURES = ["feature1", "feature2", "feature3"]

# Run the explorer
%run -i behavioral_space_explorer.py

The explorer generates:

  • {name}_behavioral_spaces.npy - All behavioral transformations
  • {name}_behave_{space}_{label}.png - PCA plots colored by each label
  • {name}_hopkins_statistics.csv - Clustering tendency metrics
  • {name}_clustering_statistics.csv - Variance explained, pairwise distances
  • {name}_outliers_{space}.csv - Outlier samples for each space

Behavioral Region Explorer

For targeted analysis of specific regions in behavioral space:

# Basic configuration
DATASET_NAME = "mydata"
DATA_FILE = "mydata.csv"
ID_COLUMN = "sample_id"
DROP_COLUMNS = ["col_a", "col_b"]
LABEL_COLUMNS = ["target1", "target2", "category"]
OUTPUT_DIR = "behavioral_exploration"
BEHAVIORAL_SPACES_FILE = "behavioral_exploration/mydata_behavioral_spaces.npy"
PLOT_MODE = "combined"  # or "separate"

# Define regions of interest in PCA space
USER_REGIONS = {
    "high_variance_cluster": {
        "space": "variance",
        "pc1_range": (0.3, 0.6),
        "pc2_range": (-0.2, 0.2),
        "description": "High variance contributors",
        "color": "red"
    },
    "entropy_outliers": {
        "space": "entropy",
        "pc1_range": (-0.5, -0.2),
        "pc2_range": (0.1, 0.4),
        "description": "Low entropy samples",
        "color": "blue"
    }
}

# Run the region explorer
%run -i behavioral_region_explorer.py

Understanding Behavioral Spaces

Variance Space: Decomposes how each sample contributes to feature spread. Negative values indicate stabilizers (typical samples near the mean). Positive values indicate stretchers (extreme samples widening distribution). Use case: quality control, identifying process instability.

Skewness Space: Decomposes how each sample contributes to distributional asymmetry. Negative values pull distribution below mean. Positive values pull distribution above mean. Near-zero values maintain symmetry. Use case: detecting biased synthesis, directional process drift.

Kurtosis Space: Decomposes how each sample contributes to tail heaviness. Negative values indicate core samples (predictable, well-behaved). Positive values indicate tail samples (rare extreme events). Use case: risk assessment, anomaly detection, reliability analysis.

Entropy Space: Decomposes how each sample contributes to information content. Positive values indicate high-information samples (rare, unique feature combinations). Negative values indicate low-information samples (common, redundant). Use case: dataset curation, experimental design, diversity quantification.

Hopkins Statistic

The Hopkins statistic H measures clustering tendency:

  • H > 0.7: Strong clustering (samples group by behavior)
  • H approximately 0.5: Random distribution (no natural grouping)
  • H < 0.3: Regular/uniform distribution

Parameter Selection

n_permutations: Controls Monte Carlo estimation accuracy.

  • 50-100: Quick exploration, debugging
  • 200-500: Standard analysis
  • 1000+: Publication, final results

n_jobs: Parallel processing for feature columns.

  • -1: Use all available CPU cores
  • 1: Single-threaded (for debugging)
  • N: Use N cores

random_state: Set for reproducibility. The implementation uses antithetic sampling for variance reduction.

API Reference

ShapleyBehaviors class:

sb = ShapleyBehaviors(n_permutations=100, n_jobs=-1, random_state=42)
Phi = sb.transform(X, value_function='variance', verbose=True)
spaces = sb.transform_multiple(X, value_functions=['variance', 'skewness', 'kurtosis', 'entropy'])

identify_outliers function:

outlier_indices, outlier_scores = identify_outliers(Phi, threshold=3.0, method='zscore')

Convenience functions:

from shapley_behaviors import (
    compute_shapley_variance,
    compute_shapley_skewness,
    compute_shapley_kurtosis,
    compute_shapley_entropy
)

Phi = compute_shapley_variance(X, n_permutations=100, n_jobs=-1, random_state=42)

Runtime Estimates

  • 500 samples, 100 permutations: 2-5 minutes
  • 500 samples, 1000 permutations: 15-30 minutes
  • 4000 samples, 100 permutations: 20-30 minutes
  • 4000 samples, 1000 permutations: 2-3 hours

Troubleshooting

ImportError: Ensure package is installed with pip install shapley_behaviors

Long runtime: Reduce N_PERMUTATIONS to 100 for testing

Memory error: Reduce N_JOBS or process data in batches

High additivity error warning: Increase n_permutations

No clustering detected (H approximately 0.5): Data may lack natural behavioral groupings

Citation

If you use this package, please cite:

Liu, T., and Barnard, A. S. (2025). Understanding interpretable patterns of
Shapley behaviours in materials data. Machine Learning: Engineering,
1, 015004. https://doi.org/10.1088/3049-4761/adaaf6

License

MIT License. See LICENSE file for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shapley_behaviors-0.1.1.tar.gz (26.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shapley_behaviors-0.1.1-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file shapley_behaviors-0.1.1.tar.gz.

File metadata

  • Download URL: shapley_behaviors-0.1.1.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for shapley_behaviors-0.1.1.tar.gz
Algorithm Hash digest
SHA256 daf29837876a6c6cee6e16f62fcc969aa17da0893edb3a1a3f2716fe30ef38d6
MD5 10bf2535c6b4716f5226c0af1f2cf6e0
BLAKE2b-256 71b7e66a503af23f16a49de3b277170a01122cfecb0089d8edfb6ac89c750c39

See more details on using hashes here.

File details

Details for the file shapley_behaviors-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for shapley_behaviors-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 20f37e7c1e06661f9c5dc9529c0620327c8c4076f3c23363e541dba4847226a8
MD5 779e5ccfaf6fe96d03d67ed9f0db5cbc
BLAKE2b-256 4c5d773a0ad4bd05db91a9809e73ebb74c4c96511dae56dfe4db0443b8579c08

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page