Skip to main content

Sundar-Tibshirani Gap Statistic for evaluating any cluster solution

Project description

Sundar-Tibshirani Gap Statistic

Python 3.8+ License: MIT scikit-learn compatible

A generalized Gap Statistic implementation for evaluating any cluster solution, not just k-means.

Overview

The original Gap Statistic (Tibshirani, Walther, & Hastie, 2001) is a popular method for determining the optimal number of clusters. However, it was designed specifically to evaluate k-means clustering solutions generated during its own optimization process.

The Sundar-Tibshirani Gap Statistic extends this framework to evaluate arbitrary cluster solutions from:

  • Hierarchical clustering
  • DBSCAN and density-based methods
  • Gaussian Mixture Models
  • Spectral clustering
  • Expert-defined segments
  • Any other clustering approach

Key Innovation

Original Gap Statistic Sundar-Tibshirani Gap Statistic
Clusters reference data with k-means Applies user-provided labels to reference data
Algorithm-specific Algorithm-agnostic
Evaluates k-means solutions only Evaluates any cluster assignment

Installation

pip install sundar-gap-stat

Or install from source:

git clone https://github.com/pvsundar/sundargap_statistic.git
cd sundargap_statistic
pip install -e .

If you also want test dependencies:

pip install -e ".[test]"

Quick Start

from sundar_gap_stat import SundarTibshiraniGapStatistic
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
import numpy as np

# Your data
X = np.random.randn(200, 4)
X_scaled = StandardScaler().fit_transform(X)

# Any clustering algorithm
agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X_scaled)

# Evaluate with Sundar-Tibshirani Gap Statistic
gap_stat = SundarTibshiraniGapStatistic(
    pca_sampling=True,
    use_user_labels=True,  # Key parameter!
    return_params=True
)

gap_value, params = gap_stat.compute_gap_statistic(
    X=X_scaled,
    labels=labels,
    B=100  # Number of reference samples
)

print(f"Gap Statistic: {gap_value:.3f}")
print(f"Standard Error: {params['sim_sks']:.3f}")

Comparing Multiple Clustering Solutions

from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.mixture import GaussianMixture

# Initialize Gap Statistic
gap_stat = SundarTibshiraniGapStatistic(use_user_labels=True)

# Compare different algorithms
algorithms = {
    'K-Means': KMeans(n_clusters=3, n_init=15, random_state=42),
    'Agglomerative': AgglomerativeClustering(n_clusters=3),
    'GMM': GaussianMixture(n_components=3, random_state=42)
}

for name, algo in algorithms.items():
    labels = algo.fit_predict(X_scaled)
    gap = gap_stat.compute_gap_statistic(X_scaled, labels, B=100)
    print(f"{name}: Gap = {gap:.3f}")

Finding Optimal k with Any Algorithm

# Use hierarchical clustering to find optimal k
gaps = []
for k in range(2, 8):
    agg = AgglomerativeClustering(n_clusters=k)
    labels = agg.fit_predict(X_scaled)
    gap = gap_stat.compute_gap_statistic(X_scaled, labels, B=100)
    gaps.append((k, gap))
    print(f"k={k}: Gap = {gap:.3f}")

# Select k using elbow method or standard Gap criterion

Parameters

Parameter Type Default Description
distance_metric str or callable 'euclidean' Distance metric for dispersion calculation
pca_sampling bool True Use PCA-based reference distribution (recommended)
standardize_within_pca bool False Standardize before PCA
use_user_labels bool True True = Sundar-Tibshirani extension; False = original Tibshirani
return_params bool False Return additional diagnostics
n_init int 12 K-means initializations (only if use_user_labels=False)
random_state int 7142 Random seed for reproducibility

Returned Parameters

When return_params=True, the method returns a tuple (gap, params) where params contains:

  • Wk: Observed within-cluster dispersion
  • sim_Wks: Array of simulated Wk values from reference distributions
  • sim_sks: Adjusted standard error (sqrt(1 + 1/B) * SD)
  • gap: Gap statistic value
  • sd_k: Standard deviation of log(sim_Wks)

Mathematical Foundation

The Sundar-Tibshirani Gap Statistic is defined as:

$$\text{Gap}_{\text{ST}}(k) = E^[\log(W_k^(\mathbf{L}))] - \log(W_k(\mathbf{L}))$$

where:

  • $\mathbf{L}$ = fixed cluster labels from any source
  • $W_k^*(\mathbf{L})$ = within-cluster dispersion applying labels $\mathbf{L}$ to reference data
  • $E^*$ = expectation over reference distributions

This differs from the original Gap Statistic, which re-clusters reference data with k-means.

Simulation Study Results

Comprehensive simulations demonstrate:

  1. Correct k detection across well-separated, overlapping, and elongated cluster structures
  2. Algorithm invariance: Consistent evaluation across K-Means, Agglomerative, and GMM
  3. Noise robustness: Maintains stability under moderate noise better than Silhouette
  4. Sample requirements: Reliable estimates with n >= 100 observations
  5. Monte Carlo convergence: B = 100 provides excellent precision

Citation

If you use this package in your research, please cite the software and the original Gap Statistic paper:

@software{balakrishnan_sundar_gap_stat_2026,
  author = {Balakrishnan, P. V. Sundar},
  title  = {Sundar-Tibshirani Gap Statistic (sundar-gap-stat)},
  year   = {2026},
  url    = {https://github.com/pvsundar/sundargap_statistic}
}

References

  • Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B, 63(2), 411-423.

License

MIT License; see LICENSE.

Author

P. V. Sundar Balakrishnan
University of Washington Bothell
School of Business
sundar@uw.edu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sundar_gap_stat-2.0.0.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sundar_gap_stat-2.0.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file sundar_gap_stat-2.0.0.tar.gz.

File metadata

  • Download URL: sundar_gap_stat-2.0.0.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sundar_gap_stat-2.0.0.tar.gz
Algorithm Hash digest
SHA256 ab8e903228ce0d91b6f8aa3c34141d528030e5653e1d648acf1f50737588aee0
MD5 1358928b4754edc4a7d5d6cf09004a03
BLAKE2b-256 02c8a1da18409dfaf82f98af651e2f9141b74789c8e2593138db94c6e447d942

See more details on using hashes here.

File details

Details for the file sundar_gap_stat-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sundar_gap_stat-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 54714cce31ad697f7ef461357c99115fe9e535b0373c96b7d4026628d6afce12
MD5 7f710c0aaac6d9fdbce36d87bde25f5c
BLAKE2b-256 d866374104bfee7a7ff503079c3b9ec44394db2b13567c35fdf8df8c0e72cf7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page