Sundar-Tibshirani Gap Statistic for evaluating any cluster solution
Project description
Sundar-Tibshirani Gap Statistic
A generalized Gap Statistic implementation for evaluating any cluster solution, not just k-means.
Overview
The original Gap Statistic (Tibshirani, Walther, & Hastie, 2001) is a popular method for determining the optimal number of clusters. However, it was designed specifically to evaluate k-means clustering solutions generated during its own optimization process.
The Sundar-Tibshirani Gap Statistic extends this framework to evaluate arbitrary cluster solutions from:
- Hierarchical clustering
- DBSCAN and density-based methods
- Gaussian Mixture Models
- Spectral clustering
- Expert-defined segments
- Any other clustering approach
Key Innovation
| Original Gap Statistic | Sundar-Tibshirani Gap Statistic |
|---|---|
| Clusters reference data with k-means | Applies user-provided labels to reference data |
| Algorithm-specific | Algorithm-agnostic |
| Evaluates k-means solutions only | Evaluates any cluster assignment |
Installation
pip install sundar-gap-stat
Or install from source:
git clone https://github.com/pvsundar/sundargap_statistic.git
cd sundargap_statistic
pip install -e .
If you also want test dependencies:
pip install -e ".[test]"
Quick Start
from sundar_gap_stat import SundarTibshiraniGapStatistic
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
import numpy as np
# Your data
X = np.random.randn(200, 4)
X_scaled = StandardScaler().fit_transform(X)
# Any clustering algorithm
agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X_scaled)
# Evaluate with Sundar-Tibshirani Gap Statistic
gap_stat = SundarTibshiraniGapStatistic(
pca_sampling=True,
use_user_labels=True, # Key parameter!
return_params=True
)
gap_value, params = gap_stat.compute_gap_statistic(
X=X_scaled,
labels=labels,
B=100 # Number of reference samples
)
print(f"Gap Statistic: {gap_value:.3f}")
print(f"Standard Error: {params['sim_sks']:.3f}")
Comparing Multiple Clustering Solutions
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
# Initialize Gap Statistic
gap_stat = SundarTibshiraniGapStatistic(use_user_labels=True)
# Compare different algorithms
algorithms = {
'K-Means': KMeans(n_clusters=3, n_init=15, random_state=42),
'Agglomerative': AgglomerativeClustering(n_clusters=3),
'GMM': GaussianMixture(n_components=3, random_state=42)
}
for name, algo in algorithms.items():
labels = algo.fit_predict(X_scaled)
gap = gap_stat.compute_gap_statistic(X_scaled, labels, B=100)
print(f"{name}: Gap = {gap:.3f}")
Finding Optimal k with Any Algorithm
# Use hierarchical clustering to find optimal k
gaps = []
for k in range(2, 8):
agg = AgglomerativeClustering(n_clusters=k)
labels = agg.fit_predict(X_scaled)
gap = gap_stat.compute_gap_statistic(X_scaled, labels, B=100)
gaps.append((k, gap))
print(f"k={k}: Gap = {gap:.3f}")
# Select k using elbow method or standard Gap criterion
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
distance_metric |
str or callable | 'euclidean' | Distance metric for dispersion calculation |
pca_sampling |
bool | True | Use PCA-based reference distribution (recommended) |
standardize_within_pca |
bool | False | Standardize before PCA |
use_user_labels |
bool | True | True = Sundar-Tibshirani extension; False = original Tibshirani |
return_params |
bool | False | Return additional diagnostics |
n_init |
int | 12 | K-means initializations (only if use_user_labels=False) |
random_state |
int | 7142 | Random seed for reproducibility |
Returned Parameters
When return_params=True, the method returns a tuple (gap, params) where params contains:
Wk: Observed within-cluster dispersionsim_Wks: Array of simulated Wk values from reference distributionssim_sks: Adjusted standard error (sqrt(1 + 1/B) * SD)gap: Gap statistic valuesd_k: Standard deviation of log(sim_Wks)
Mathematical Foundation
The Sundar-Tibshirani Gap Statistic is defined as:
$$\text{Gap}_{\text{ST}}(k) = E^[\log(W_k^(\mathbf{L}))] - \log(W_k(\mathbf{L}))$$
where:
- $\mathbf{L}$ = fixed cluster labels from any source
- $W_k^*(\mathbf{L})$ = within-cluster dispersion applying labels $\mathbf{L}$ to reference data
- $E^*$ = expectation over reference distributions
This differs from the original Gap Statistic, which re-clusters reference data with k-means.
Simulation Study Results
Comprehensive simulations demonstrate:
- Correct k detection across well-separated, overlapping, and elongated cluster structures
- Algorithm invariance: Consistent evaluation across K-Means, Agglomerative, and GMM
- Noise robustness: Maintains stability under moderate noise better than Silhouette
- Sample requirements: Reliable estimates with n >= 100 observations
- Monte Carlo convergence: B = 100 provides excellent precision
Citation
If you use this package in your research, please cite the software and the original Gap Statistic paper:
@software{balakrishnan_sundar_gap_stat_2026,
author = {Balakrishnan, P. V. Sundar},
title = {Sundar-Tibshirani Gap Statistic (sundar-gap-stat)},
year = {2026},
url = {https://github.com/pvsundar/sundargap_statistic}
}
References
- Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B, 63(2), 411-423.
License
MIT License; see LICENSE.
Author
P. V. Sundar Balakrishnan
University of Washington Bothell
School of Business
sundar@uw.edu
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sundar_gap_stat-2.0.0.tar.gz.
File metadata
- Download URL: sundar_gap_stat-2.0.0.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab8e903228ce0d91b6f8aa3c34141d528030e5653e1d648acf1f50737588aee0
|
|
| MD5 |
1358928b4754edc4a7d5d6cf09004a03
|
|
| BLAKE2b-256 |
02c8a1da18409dfaf82f98af651e2f9141b74789c8e2593138db94c6e447d942
|
File details
Details for the file sundar_gap_stat-2.0.0-py3-none-any.whl.
File metadata
- Download URL: sundar_gap_stat-2.0.0-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54714cce31ad697f7ef461357c99115fe9e535b0373c96b7d4026628d6afce12
|
|
| MD5 |
7f710c0aaac6d9fdbce36d87bde25f5c
|
|
| BLAKE2b-256 |
d866374104bfee7a7ff503079c3b9ec44394db2b13567c35fdf8df8c0e72cf7d
|