Skip to main content

SheShe: Smart High-dimensional Edge Segmentation & Hyperboundary Explorer

Project description

SheShe

Smart High-dimensional Edge Segmentation & Hyperboundary Explorer

Edge segmentation and hyperboundary exploration based on local maxima of the class probability (classification) or the predicted value (regression).


Installation

Requires Python >=3.9 and it is recommended to work inside a virtual environment.

pip install -e .

Base dependencies: numpy, pandas, scikit-learn>=1.1, matplotlib

For a development environment with tests:

pip install -e ".[dev]"
PYTHONPATH=src pytest -q

Quick API

from sheshe import ModalBoundaryClustering

# classification
clf = ModalBoundaryClustering(
    base_estimator=None,           # default LogisticRegression
    task="classification",         # "classification" | "regression"
    base_2d_rays=8,                # number of rays in 2D (≈45°)
    direction="center_out",        # "center_out" | "outside_in"
    scan_radius_factor=3.0,
    scan_steps=64,
    random_state=0
)

# regression (example)
reg = ModalBoundaryClustering(task="regression")

Methods

  • fit(X, y)
  • predict(X)
  • predict_proba(X) → classification: per-class probabilities; regression: normalized value [0,1]
  • interpretability_summary(feature_names=None) → DataFrame with:
    • Type: "centroid" | "inflection_point"
    • Distance: radius from the center to the inflection point
    • Category: class (or "NA" in regression)
    • slope: df/dt at the inflection point
    • real_value / norm_value
    • coord_0..coord_{d-1} or feature names
  • plot_pairs(X, y=None, max_pairs=None) → 2D plots for all pair combinations
  • save(filepath) → save the model using joblib
  • ModalBoundaryClustering.load(filepath) → load a saved instance

How does it work?

  1. Train/use a base model from sklearn (classification with predict_proba or regression with predict).
  2. Find local maxima via gradient ascent with barriers at the domain boundaries.
  3. From the maximum, trace rays (directions) on the hypersphere:
    • 2D: 8 rays by default
    • 3D: ~26 directions (coverage by spherical caps using Fibonacci sampling)
    • 3D: mixture of a few global directions + 2D/3D subspaces

  4. Along each ray, scan radially and compute the first inflection point according to direction:
    • center_out: from the center outward
    • outside_in: from the outside toward the center Also record the slope (df/dt) at that point.
  5. Connect the inflection points to form the boundary of the region with high probability/value.

Examples

Classification — Iris

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sheshe import ModalBoundaryClustering

iris = load_iris()
X, y = iris.data, iris.target

sh = ModalBoundaryClustering(
    base_estimator=LogisticRegression(max_iter=1000),
    task="classification",
    base_2d_rays=8,
    random_state=0,
).fit(X, y)

print(sh.interpretability_summary(iris.feature_names).head())
sh.plot_pairs(X, y, max_pairs=3)   # generate the plots
plt.show()

Classification with pre-trained model

import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sheshe import ModalBoundaryClustering

wine = load_wine()
X, y = wine.data, wine.target

# Train a model independently
base_model = RandomForestClassifier(n_estimators=200, random_state=0)
base_model.fit(X, y)

# Use SheShe with that pre-fitted model
sh = ModalBoundaryClustering(
    base_estimator=base_model,
    task="classification",
    base_2d_rays=8,
    random_state=0,
).fit(X, y)

sh.plot_pairs(X, y, max_pairs=2)
plt.show()

Regression — Diabetes

import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sheshe import ModalBoundaryClustering

diab = load_diabetes()
X, y = diab.data, diab.target

sh = ModalBoundaryClustering(
    base_estimator=GradientBoostingRegressor(random_state=0),
    task="regression",
    base_2d_rays=8,
    random_state=0,
).fit(X, y)

print(sh.interpretability_summary(diab.feature_names).head())
sh.plot_pairs(X, max_pairs=3)
plt.show()

Saving figures

from pathlib import Path
import matplotlib.pyplot as plt

# after calling ``sh.plot_pairs(...)``
out_dir = Path("images")
out_dir.mkdir(exist_ok=True)
for i, fig_num in enumerate(plt.get_fignums()):
    plt.figure(fig_num)
    plt.savefig(out_dir / f"pair_{i}.png")
plt.close(fig_num)

Plotting with pandas DataFrames

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sheshe import ModalBoundaryClustering

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

sh = ModalBoundaryClustering().fit(df, iris.target)
sh.plot_pairs(df, iris.target, max_pairs=2)  # usa nombres de columnas en los ejes
plt.show()

Visualizing interpretability summary

import matplotlib.pyplot as plt

summary = sh.interpretability_summary(df.columns)
centroids = summary[summary["Type"] == "centroid"]
plt.scatter(centroids["coord_0"], centroids["coord_1"], c=centroids["Category"])
plt.xlabel("coord_0")
plt.ylabel("coord_1")
plt.show()

Save and load model

from pathlib import Path
from sklearn.datasets import load_iris
from sheshe import ModalBoundaryClustering

iris = load_iris()
X, y = iris.data, iris.target

sh = ModalBoundaryClustering().fit(X, y)
path = Path("sheshe_model.joblib")
sh.save(path)
sh2 = ModalBoundaryClustering.load(path)
print((sh.predict(X) == sh2.predict(X)).all())

For more complete examples, see the examples/ folder.

Experiments and benchmark

The experiments comparing against unsupervised algorithms are located in the experiments/ folder. The script compare_unsupervised.py evaluates five different datasets, explores parameters of SheShe, KMeans and DBSCAN, and stores four metrics (ARI, homogeneity, completeness, v_measure) along with the execution time (runtime_sec).

python experiments/compare_unsupervised.py --runs 5
cat benchmark/unsupervised_results_summary.csv | head

Results are generated inside benchmark/ (valores por repetición y medias en *_summary.csv).

For the manuscript we provide additional scripts in paper_experiments.py which perform supervised comparisons, ablation studies over base_2d_rays and direction, and sensitivity analyses w.r.t. dimensionality and Gaussian noise. Executing the script generates tables with todas las repeticiones y un resumen (*_summary.csv), además de figuras (*.png) bajo benchmark/:

python experiments/paper_experiments.py --runs 5

Key parameters

  • base_2d_rays → controls angular resolution in 2D (8 by default). 3D scales to ~26; d>3 uses subspaces.
  • direction → "center_out" | "outside_in" to locate the inflection point.
  • scan_radius_factor, scan_steps → size and resolution of the radial scan.
  • grad_* → hyperparameters of gradient ascent (rate, iterations, tolerances).
  • max_subspaces → max number of subspaces considered when d>3.

Limitations

  • Depends on the surface produced by the base model (can be rough in RF).
  • In high dimension, the boundary is an approximation (subspaces).
  • Finds local maxima (does not guarantee the global one), mitigated with multiple seeds.

Contribute

Improvements are welcome. To propose changes:

  1. Fork the repository and create a descriptive branch.

  2. Install development dependencies and run the tests:

    pip install -e ".[dev]"
    PYTHONPATH=src pytest -q
    
  3. Submit a pull request with a clear description of the change.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sheshe-0.1.0.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sheshe-0.1.0-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file sheshe-0.1.0.tar.gz.

File metadata

  • Download URL: sheshe-0.1.0.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for sheshe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 44073929914366dc841153a208e70f872f332f61aece0ca62d7a973db97634f7
MD5 67d0b656b9810754f3c78984007a6df8
BLAKE2b-256 a61a16b194e7c964b651cec6828ae9ae40510ab6c36e4fde96a83b9a9f9c66bb

See more details on using hashes here.

File details

Details for the file sheshe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sheshe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for sheshe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8e8c79fd292c341ae8eaa47e7c2a90a7c403016bbf2ce2fd6e22e4afbfabee81
MD5 eb638d8474def0b3b27138f1b8faacab
BLAKE2b-256 415e713b1663c4ec6db216340eeb9f2a96be9869eb1335b82902d6c10b988e3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page