Skip to main content

Symmetric Residual Decomposition — instance-level influence analysis via Shapley-style residual attribution

Project description

RSHAP — Symmetric Residual Decomposition

PyPI version Python License: MIT

RSHAP is a Python package for instance-level influence analysis of machine learning models. It uses a Shapley-style residual decomposition to measure how much each training instance affects the prediction errors of every other instance, producing a rich picture of inter-instance relationships that standard feature-importance methods cannot provide.

This package implements and extends the method introduced in:

Liu, T. & Barnard, A. S. (2023). Shapley Based Residual Decomposition for Instance Analysis. Proceedings of the 40th International Conference on Machine Learning (ICML), PMLR 202, pp. 21375–21387. https://proceedings.mlr.press/v202/liu23b.html

The original research artefacts and paper notebooks are maintained by Dr Tommy Liu at github.com/uilymmot/residual-decomposition.


What RSHAP does

Where classical SHAP answers "which features matter for this prediction?", RSHAP answers "which training instances matter, and for whom?"

For a dataset of N instances, RSHAP produces an N×N composition matrix (phi). Each entry phi[k, j] captures the marginal change in instance j's prediction residual when instance k is added to the training coalition. Aggregating this matrix in different ways reveals:

  • Which instances are the most influential — driving up or suppressing errors across the whole dataset
  • Which instances are the most sensitive — whose predictions are most affected by others
  • Inter-group dynamics — whether instances from one class or category systematically help or hinder predictions for another

Installation

pip install rshap

How RSHAP works — and what you need to provide

In a typical machine learning workflow you train a model to make predictions. RSHAP is a separate analysis that runs alongside that — it is not an extension of your trained model and does not receive it as input.

You will usually have two independent steps:

# Step A — your own model, trained for prediction and evaluation as normal
model = XGBClassifier(**xgb_params)
model.fit(X_train, y_train)
predictions = model.predict(X_test)   # use this model for your results

# Step B — RSHAP analysis, run independently on the same data
#   Pass the model CLASS and its params — not the trained model instance above.
#   RSHAP creates its own fresh instances of the model internally.
rshap = ResidualDecompositionSymmetric()
rshap.fit(X_train, y_train,
          model_class=XGBClassifier,   # the class, not the fitted 'model' above
          model_params=xgb_params,
          iterations=100, regression=False)

model and rshap are completely independent. RSHAP never sees your trained model. It uses the class and parameters you provide to build and train hundreds of its own fresh model instances internally, each on a different random subset of the data, in order to measure how each training instance affects the prediction errors of every other.

What you provide to rshap.fit():

  • X, y — your raw data (same data you trained your model on)
  • model_class — the type of model to use, as a string ('xgb', 'ridge', 'svm', 'rf', 'mlp') or a class directly (XGBClassifier, Ridge, etc.)
  • model_paramsmust be the same hyperparameters you used to train your own model. RSHAP's results reflect the residuals of the model it trains internally — if the hyperparameters differ, the phi matrix will describe a different model's behaviour, not the one you are analysing.

What RSHAP does with these:

  • Generates many random orderings of your N instances
  • For each ordering, trains a fresh instance of the specified model on progressively larger subsets, recording how each instance's addition changes the prediction errors of all others
  • Averages the results into the N×N phi matrix

Quickstart

import numpy as np
import matplotlib.pyplot as plt
from RSHAP import ResidualDecompositionSymmetric, draw_heatmap

# Step 1 — prepare your data (no model training needed)
X = np.random.randn(100, 5)
y = X @ [1.5, -1.0, 0.5, 0.2, -0.8] + np.random.randn(100) * 0.3

# Step 2 — tell RSHAP which model type to use and run the analysis
#   model_class  : string name or class — RSHAP instantiates and trains this internally
#   model_params : hyperparameters for that model (optional, defaults to {})
#   iterations   : how many permutation pairs to average over
#   regression   : True for continuous targets, False for classification labels
#   n_jobs       : parallel workers (-1 = all cores, 1 = sequential / safer on Windows)
rshap = ResidualDecompositionSymmetric()
rshap.fit(X, y, model_class='ridge', model_params={'alpha': 1.0},
          iterations=50, regression=True, n_jobs=1)

# Step 3 — retrieve the N×N phi matrix and its signed variant
phi     = rshap.get_composition()   # raw phi matrix
contrib = rshap.get_contribution()  # sign-adjusted matrix

influence = phi.sum(axis=0)         # per-instance influence score (column sums)
print("Most influential instance:", influence.argmax())

# Step 4 — visualise
plt.figure(figsize=(7, 5))
sc = rshap.cc_plot(coloring=y)
plt.colorbar(sc, label="Target")
plt.title("CC Plot")
plt.show()

labels = np.array(["GroupA"] * 50 + ["GroupB"] * 50)
fig, ax = draw_heatmap(rshap, labels)
plt.show()

fit() must be called before retrieving results or producing visualisations. Once complete, the fitted object can be passed to draw_heatmap or used to call cc_plot as many times as needed without re-running the analysis.

Classification

The workflow is identical — set regression=False and name your preferred model. RSHAP automatically selects the classification variant (e.g. SVRSVC, RidgeLogisticRegression):

rshap = ResidualDecompositionSymmetric()
rshap.fit(X, y_binary, model_class='svm', model_params={'C': 1.0},
          iterations=50, regression=False, n_jobs=1)

Passing a model class directly

If you prefer not to use the string shorthand, pass the class and its parameters explicitly:

from sklearn.ensemble import RandomForestRegressor

rshap = ResidualDecompositionSymmetric()
rshap.fit(X, y, model_class=RandomForestRegressor,
          model_params={'n_estimators': 100, 'max_depth': 5},
          iterations=50, regression=True, n_jobs=1)

Supported models

Pass a model as a string name or as a class object. The string interface automatically selects the regression or classification variant based on the regression flag.

String Regression class Classification class
'ridge' Ridge LogisticRegression
'logistic' Ridge LogisticRegression
'svm' SVR SVC
'rf' RandomForestRegressor RandomForestClassifier
'mlp' MLPRegressor MLPClassifier
'xgb' XGBRegressor XGBClassifier

Custom class:

from sklearn.linear_model import Ridge
rshap.fit(X, y, model_class=Ridge, model_params={'alpha': 2.0}, iterations=50)

Custom hyperparameters:

rshap.fit(X, y, model_class='xgb',
          model_params={'n_estimators': 200, 'max_depth': 4},
          iterations=100, regression=True, n_jobs=-1)

API reference

ResidualDecompositionSymmetric

fit(X, y, model_class, model_params, iterations, regression, n_jobs)

Parameter Type Default Description
X array (N, F) Feature matrix
y array (N,) Target vector
model_class str or class None Model to use (see table above)
model_params dict None Keyword arguments passed to the model constructor
iterations int 100 Number of symmetric permutation pairs. Pass -1 for automatic convergence.
regression bool True True for regression, False for classification
n_jobs int -1 Parallel workers (-1 = all cores, 1 = sequential)

get_composition() → ndarray (N, N)

Returns the raw phi matrix.

get_contribution() → ndarray (N, N)

Returns a sign-adjusted version of phi. Each row i is scaled by −sign(column_sum_i), so that positive values consistently indicate a helpful influence on prediction accuracy.

cc_plot(coloring, fontsizes, axis_lines, cc_function, categorical_colouring, drawparams, legend_label) → scatter object

Plots the CC plot on the current matplotlib axes. Full parameter reference: VISUALISATION_GUIDE.md

draw_heatmap(rshap_object, labels, decimals, num_ticks, fontsizes, ax) → (fig, ax)

Plots an inter-group contribution heatmap. Full parameter reference: VISUALISATION_GUIDE.md


Parallelism and memory

RSHAP uses joblib to parallelise permutation evaluations. On Windows, spawning many subprocesses that each import heavy libraries (pandas, sklearn, xgboost) can exhaust the virtual memory paging file. If you encounter [WinError 1455] or BrokenProcessPool errors, set n_jobs=1:

rshap.fit(X, y, model_class='rf', iterations=50, n_jobs=1)

Convergence mode

Pass iterations=-1 to run until the phi matrix stabilises rather than for a fixed number of iterations:

rshap.fit(X, y, model_class='ridge', iterations=-1, n_jobs=1)

The algorithm checks convergence every 10 iteration blocks and stops when the mean absolute relative change across the phi matrix falls below 2.5%.


Documentation

Full documentation is on GitHub:

Document Contents
IMPLEMENTATION.md Algorithm internals — permutations, phi matrix, convergence
INTERPRETATION.md What the results mean and how to read them
VISUALISATION_GUIDE.md CC plots and heatmaps — all options explained
TEST_SUITE.md Guide to the test suite notebook
Example notebook End-to-end worked example on MXene materials data (regression + classification)

Citation

If you use RSHAP in your research, please cite the original paper:

@InProceedings{pmlr-v202-liu23b,
  title     = {Shapley Based Residual Decomposition for Instance Analysis},
  author    = {Liu, Tommy and Barnard, Amanda S},
  booktitle = {Proceedings of the 40th International Conference on Machine Learning},
  pages     = {21375--21387},
  year      = {2023},
  volume    = {202},
  series    = {Proceedings of Machine Learning Research},
  publisher = {PMLR}
}

The original research artefacts accompanying the paper are maintained by Dr Tommy Liu at github.com/uilymmot/residual-decomposition.


Authors


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rshap-0.1.1.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rshap-0.1.1-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file rshap-0.1.1.tar.gz.

File metadata

  • Download URL: rshap-0.1.1.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.10

File hashes

Hashes for rshap-0.1.1.tar.gz
Algorithm Hash digest
SHA256 cc2fc7cc401c30e88757994209bc8334f1faef4e4c061f3c48d01aa240fa442b
MD5 b3ed45960117adad55e23e2db1f190cb
BLAKE2b-256 bebebc6d8145d27adf7afc222483f6def86cc30a183d3acefddb5c7df1fcf408

See more details on using hashes here.

File details

Details for the file rshap-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: rshap-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.10

File hashes

Hashes for rshap-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2d41f91ba9211cc7548d416862d58618580723342d08cf973ec049a19e2c5eb3
MD5 fe0bf76876310ec05837ddc8a3acd19a
BLAKE2b-256 40d8d15cd3243fd079671563cbb61c96b1ba0c6f0ea29945928a907b940e3ffd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page