SheShe: Smart High-dimensional Edge Segmentation & Hyperboundary Explorer
Project description
SheShe
Smart High-dimensional Edge Segmentation & Hyperboundary Explorer
Edge segmentation and hyperboundary exploration based on local maxima of the class probability (classification) or the predicted value (regression).
Installation
Requires Python >=3.9 and it is recommended to work inside a virtual environment. Install the latest release from PyPI:
pip install sheshe
Base dependencies: numpy, pandas, scikit-learn>=1.1, matplotlib
For a development environment with tests:
pip install -e ".[dev]"
PYTHONPATH=src pytest -q
Quick API
from sheshe import ModalBoundaryClustering
# classification
clf = ModalBoundaryClustering(
base_estimator=None, # default LogisticRegression
task="classification", # "classification" | "regression"
base_2d_rays=24,
direction="center_out", # "center_out" | "outside_in"
scan_radius_factor=3.0,
scan_steps=24,
random_state=0
)
# regression (example)
reg = ModalBoundaryClustering(task="regression")
Methods
fit(X, y)predict(X)predict_proba(X)→ classification: per-class probabilities; regression: normalized value [0,1]interpretability_summary(feature_names=None)→ DataFrame with:Type: "centroid" | "inflection_point"Distance: radius from the center to the inflection pointCategory: class (or "NA" in regression)slope: df/dt at the inflection pointreal_value/norm_valuecoord_0..coord_{d-1}or feature names
plot_pairs(X, y=None, max_pairs=None)→ 2D plots for all pair combinationssave(filepath)→ save the model usingjoblibModalBoundaryClustering.load(filepath)→ load a saved instance
How does it work?
- Train/use a base model from sklearn (classification with
predict_probaor regression withpredict). - Find local maxima via gradient ascent with barriers at the domain boundaries.
- From the maximum, trace rays (directions) on the hypersphere:
- 2D: 8 rays by default
- 3D: ~26 directions (coverage by spherical caps using Fibonacci sampling)
-
3D: mixture of a few global directions + 2D/3D subspaces
- Along each ray, scan radially and compute the first inflection point
according to
direction:center_out: from the center outwardoutside_in: from the outside toward the center Also record the slope (df/dt) at that point.
- Connect the inflection points to form the boundary of the region with high probability/value.
Examples
Classification — Iris
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sheshe import ModalBoundaryClustering
iris = load_iris()
X, y = iris.data, iris.target
sh = ModalBoundaryClustering(
base_estimator=LogisticRegression(max_iter=1000),
task="classification",
base_2d_rays=8,
random_state=0,
).fit(X, y)
print(sh.interpretability_summary(iris.feature_names).head())
sh.plot_pairs(X, y, max_pairs=3) # generate the plots
plt.show()
Classification with pre-trained model
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sheshe import ModalBoundaryClustering
wine = load_wine()
X, y = wine.data, wine.target
# Train a model independently
base_model = RandomForestClassifier(n_estimators=200, random_state=0)
base_model.fit(X, y)
# Use SheShe with that pre-fitted model
sh = ModalBoundaryClustering(
base_estimator=base_model,
task="classification",
base_2d_rays=8,
random_state=0,
).fit(X, y)
sh.plot_pairs(X, y, max_pairs=2)
plt.show()
Regression — Diabetes
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sheshe import ModalBoundaryClustering
diab = load_diabetes()
X, y = diab.data, diab.target
sh = ModalBoundaryClustering(
base_estimator=GradientBoostingRegressor(random_state=0),
task="regression",
base_2d_rays=8,
random_state=0,
).fit(X, y)
print(sh.interpretability_summary(diab.feature_names).head())
sh.plot_pairs(X, max_pairs=3)
plt.show()
Saving figures
from pathlib import Path
import matplotlib.pyplot as plt
# after calling ``sh.plot_pairs(...)``
out_dir = Path("images")
out_dir.mkdir(exist_ok=True)
for i, fig_num in enumerate(plt.get_fignums()):
plt.figure(fig_num)
plt.savefig(out_dir / f"pair_{i}.png")
plt.close(fig_num)
Plotting with pandas DataFrames
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sheshe import ModalBoundaryClustering
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
sh = ModalBoundaryClustering().fit(df, iris.target)
sh.plot_pairs(df, iris.target, max_pairs=2) # usa nombres de columnas en los ejes
plt.show()
Visualizing interpretability summary
import matplotlib.pyplot as plt
summary = sh.interpretability_summary(df.columns)
centroids = summary[summary["Type"] == "centroid"]
plt.scatter(centroids["coord_0"], centroids["coord_1"], c=centroids["Category"])
plt.xlabel("coord_0")
plt.ylabel("coord_1")
plt.show()
Save and load model
from pathlib import Path
from sklearn.datasets import load_iris
from sheshe import ModalBoundaryClustering
iris = load_iris()
X, y = iris.data, iris.target
sh = ModalBoundaryClustering().fit(X, y)
path = Path("sheshe_model.joblib")
sh.save(path)
sh2 = ModalBoundaryClustering.load(path)
print((sh.predict(X) == sh2.predict(X)).all())
For more complete examples, see the examples/ folder.
SubspaceScout
SubspaceScout helps discover informative feature subspaces (pairs, trios, ...)
before running SheShe. It can work purely with mutual information or leverage
optional models like LightGBM+SHAP or EBM to rank feature interactions.
from sheshe import SubspaceScout
scout = SubspaceScout(
# model_method='lightgbm', # default uses MI; LightGBM and SHAP are optional
max_order=4, # explore pairs, trios and quartets
top_m=50, # limit to top 50 informative features
base_pairs_limit=12, # seed pairs for orders >=3
beam_width=10, # combos kept per layer
extend_candidate_pool=16, # random candidate features per parent
branch_per_parent=4, # extensions per parent
marginal_gain_min=1e-3, # minimum gain to accept
max_eval_per_order=150, # cap MI evaluations per order
sample_size=4096, # subsample size
time_budget_s=None, # e.g., 15.0 for 15 seconds
task='classification',
random_state=0,
)
subspaces = scout.fit(X, y)
Experiments and benchmark
The experiments comparing against unsupervised algorithms are located in
the experiments/ folder. The script
compare_unsupervised.py evaluates five
different datasets, explores parameters of SheShe, KMeans and
DBSCAN, and stores four metrics (ARI, homogeneity, completeness,
v_measure) along with the execution time (runtime_sec).
python experiments/compare_unsupervised.py --runs 5
cat benchmark/unsupervised_results_summary.csv | head
Results are generated inside benchmark/ (valores por repetición y medias en
*_summary.csv).
For the manuscript we provide additional scripts in
paper_experiments.py which perform
supervised comparisons, ablation studies over base_2d_rays and direction,
and sensitivity analyses w.r.t. dimensionality and Gaussian noise. Executing
the script generates tables with todas las repeticiones y un resumen (*_summary.csv),
además de figuras (*.png) bajo benchmark/:
python experiments/paper_experiments.py --runs 5
Key parameters
base_2d_rays→ controls angular resolution in 2D (24 by default). 3D scales to ~26; d>3 uses subspaces.direction→ "center_out" | "outside_in" to locate the inflection point.scan_radius_factor,scan_steps→ size and resolution of the radial scan.grad_*→ hyperparameters of gradient ascent (rate, iterations, tolerances).max_subspaces→ max number of subspaces considered when d>3.density_alpha/density_k→ optional density penalty computed with an HNSW k‑NN search (viahnswlib) to keep centers inside the data cloud. The normalized value is multiplied by(density(x))**density_alpha; setdensity_alpha=0to disable.
Performance tips
- Defaults favour speed:
base_2d_rays=24,scan_steps=24andn_max_seeds=2. - The heuristic
auto_rays_by_dim=True(default) reduces rays for high dimensional datasets:- 25–64 features →
base_2d_rayscapped at 16. - 65+ features →
base_2d_rayscapped at 12. For 30D problems such as Breast Cancer this matches the recommendedbase_2d_rays=16.
- 25–64 features →
Limitations
- Depends on the surface produced by the base model (can be rough in RF).
- In high dimension, the boundary is an approximation (subspaces).
- Finds local maxima (does not guarantee the global one), mitigated with multiple seeds.
Contribute
Improvements are welcome. To propose changes:
-
Fork the repository and create a descriptive branch.
-
Install development dependencies and run the tests:
pip install -e ".[dev]" PYTHONPATH=src pytest -q
-
Submit a pull request with a clear description of the change.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sheshe-0.1.2.tar.gz.
File metadata
- Download URL: sheshe-0.1.2.tar.gz
- Upload date:
- Size: 34.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acf5b97276d7ba44a1a07caaeb9d7e18c25010b8f55710458688c3de325ff4ec
|
|
| MD5 |
fd6fc66a9cf6637f0a209cb420b60461
|
|
| BLAKE2b-256 |
6e80585936cdbc8145aded6cd6529103720b2bb1e2f739631076200cebf2f77a
|
File details
Details for the file sheshe-0.1.2-py3-none-any.whl.
File metadata
- Download URL: sheshe-0.1.2-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47c34f7ae1d1f80b71d293a1067ca6704a663fa0e2eb017088b6e565e8c450ce
|
|
| MD5 |
35886b2272aaff5d125604e94a3b0dbf
|
|
| BLAKE2b-256 |
ec7b4b342b157e61d2789c23b3b98170fe9c1f7e7b940f5ccfb0a7ef78a7f254
|