Sundar-Tibshirani Gap Statistic for evaluating any cluster solution

These details have not been verified by PyPI

Project links

Project description

Sundar-Tibshirani Gap Statistic

A generalized Gap Statistic implementation for evaluating any cluster solution, not just k-means.

Overview

The original Gap Statistic (Tibshirani, Walther, & Hastie, 2001) is a popular method for determining the optimal number of clusters. However, it was designed specifically to evaluate k-means clustering solutions generated during its own optimization process.

The Sundar-Tibshirani Gap Statistic extends this framework to evaluate arbitrary cluster solutions from:

Hierarchical clustering
DBSCAN and density-based methods
Gaussian Mixture Models
Spectral clustering
Expert-defined segments
Any other clustering approach

Key Innovation

Original Gap Statistic	Sundar-Tibshirani Gap Statistic
Clusters reference data with k-means	Applies user-provided labels to reference data
Algorithm-specific	Algorithm-agnostic
Evaluates k-means solutions only	Evaluates any cluster assignment

Installation

pip install sundar-gap-stat

Or install from source:

git clone https://github.com/pvsundar/sundargap_statistic.git
cd sundargap_statistic
pip install -e .

If you also want test dependencies:

pip install -e ".[test]"

Quick Start

from sundar_gap_stat import SundarTibshiraniGapStatistic
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
import numpy as np

# Your data
X = np.random.randn(200, 4)
X_scaled = StandardScaler().fit_transform(X)

# Any clustering algorithm
agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X_scaled)

# Evaluate with Sundar-Tibshirani Gap Statistic
gap_stat = SundarTibshiraniGapStatistic(
    pca_sampling=True,
    use_user_labels=True,  # Key parameter!
    return_params=True
)

gap_value, params = gap_stat.compute_gap_statistic(
    X=X_scaled,
    labels=labels,
    B=100  # Number of reference samples
)

print(f"Gap Statistic: {gap_value:.3f}")
print(f"Standard Error: {params['sim_sks']:.3f}")

Comparing Multiple Clustering Solutions

from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.mixture import GaussianMixture

# Initialize Gap Statistic
gap_stat = SundarTibshiraniGapStatistic(use_user_labels=True)

# Compare different algorithms
algorithms = {
    'K-Means': KMeans(n_clusters=3, n_init=15, random_state=42),
    'Agglomerative': AgglomerativeClustering(n_clusters=3),
    'GMM': GaussianMixture(n_components=3, random_state=42)
}

for name, algo in algorithms.items():
    labels = algo.fit_predict(X_scaled)
    gap = gap_stat.compute_gap_statistic(X_scaled, labels, B=100)
    print(f"{name}: Gap = {gap:.3f}")

Finding Optimal k with Any Algorithm

# Use hierarchical clustering to find optimal k
gaps = []
for k in range(2, 8):
    agg = AgglomerativeClustering(n_clusters=k)
    labels = agg.fit_predict(X_scaled)
    gap = gap_stat.compute_gap_statistic(X_scaled, labels, B=100)
    gaps.append((k, gap))
    print(f"k={k}: Gap = {gap:.3f}")

# Select k using elbow method or standard Gap criterion

Parameters

Parameter	Type	Default	Description
`distance_metric`	str or callable	'euclidean'	Distance metric for dispersion calculation
`pca_sampling`	bool	True	Use PCA-based reference distribution (recommended)
`standardize_within_pca`	bool	False	Standardize before PCA
`use_user_labels`	bool	True	True = Sundar-Tibshirani extension; False = original Tibshirani
`return_params`	bool	False	Return additional diagnostics
`n_init`	int	12	K-means initializations (only if `use_user_labels=False`)
`random_state`	int	7142	Random seed for reproducibility

Returned Parameters

When return_params=True, the method returns a tuple (gap, params) where params contains:

Wk: Observed within-cluster dispersion
sim_Wks: Array of simulated Wk values from reference distributions
sim_sks: Adjusted standard error (sqrt(1 + 1/B) * SD)
gap: Gap statistic value
sd_k: Standard deviation of log(sim_Wks)

Mathematical Foundation

The Sundar-Tibshirani Gap Statistic is defined as:

$$\text{Gap}_{\text{ST}}(k) = E^[\log(W_k^(\mathbf{L}))] - \log(W_k(\mathbf{L}))$$

where:

$\mathbf{L}$ = fixed cluster labels from any source
$W_k^*(\mathbf{L})$ = within-cluster dispersion applying labels $\mathbf{L}$ to reference data
$E^*$ = expectation over reference distributions

This differs from the original Gap Statistic, which re-clusters reference data with k-means.

Simulation Study Results

Comprehensive simulations demonstrate:

Correct k detection across well-separated, overlapping, and elongated cluster structures
Algorithm invariance: Consistent evaluation across K-Means, Agglomerative, and GMM
Noise robustness: Maintains stability under moderate noise better than Silhouette
Sample requirements: Reliable estimates with n >= 100 observations
Monte Carlo convergence: B = 100 provides excellent precision

Citation

If you use this package in your research, please cite the software and the original Gap Statistic paper:

@software{balakrishnan_sundar_gap_stat_2026,
  author = {Balakrishnan, P. V. Sundar},
  title  = {Sundar-Tibshirani Gap Statistic (sundar-gap-stat)},
  year   = {2026},
  url    = {https://github.com/pvsundar/sundargap_statistic}
}

References

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B, 63(2), 411-423.

License

MIT License; see LICENSE.

Author

P. V. Sundar Balakrishnan
University of Washington Bothell
School of Business
sundar@uw.edu

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

Jan 11, 2026

1.2

Sep 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sundar_gap_stat-2.0.0.tar.gz (14.3 kB view details)

Uploaded Jan 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sundar_gap_stat-2.0.0-py3-none-any.whl (11.0 kB view details)

Uploaded Jan 11, 2026 Python 3

File details

Details for the file sundar_gap_stat-2.0.0.tar.gz.

File metadata

Download URL: sundar_gap_stat-2.0.0.tar.gz
Upload date: Jan 11, 2026
Size: 14.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sundar_gap_stat-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ab8e903228ce0d91b6f8aa3c34141d528030e5653e1d648acf1f50737588aee0`
MD5	`1358928b4754edc4a7d5d6cf09004a03`
BLAKE2b-256	`02c8a1da18409dfaf82f98af651e2f9141b74789c8e2593138db94c6e447d942`

See more details on using hashes here.

File details

Details for the file sundar_gap_stat-2.0.0-py3-none-any.whl.

File metadata

Download URL: sundar_gap_stat-2.0.0-py3-none-any.whl
Upload date: Jan 11, 2026
Size: 11.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sundar_gap_stat-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`54714cce31ad697f7ef461357c99115fe9e535b0373c96b7d4026628d6afce12`
MD5	`7f710c0aaac6d9fdbce36d87bde25f5c`
BLAKE2b-256	`d866374104bfee7a7ff503079c3b9ec44394db2b13567c35fdf8df8c0e72cf7d`

See more details on using hashes here.

sundar-gap-stat 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Sundar-Tibshirani Gap Statistic

Overview

Key Innovation

Installation

Quick Start

Comparing Multiple Clustering Solutions

Finding Optimal k with Any Algorithm

Parameters

Returned Parameters

Mathematical Foundation

Simulation Study Results

Citation

References

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes