Skip to main content

An upper bound of the Average Silhouette Width

Project description

Tests

Silhouette Upper Bound

An upper bound of the Average Silhouette Width.

Silhouette Samples Figure 1: Kmeans clustering applied to a synthetic dataset. Code available here.

ASW vs K Figure 2: ASW for varying K. Code available here.

Overview

Evaluating clustering quality is a fundamental task in cluster analysis, and the Average Silhouette Width (ASW) is one of the most widely used metrics for this purpose. ASW scores range from $-1$ to $1$, where:

  • Values near 1 indicate well-separated, compact clusters

  • Values around 0 suggest overlapping or ambiguous cluster assignments

  • Values near -1 imply that many points may have been misassigned

Optimizing the Silhouette score is a common objective in clustering workflows. However, since we rarely know the true global ASW-maximum achievable for a dataset, it's difficult to assess how close a given clustering result is to being globally optimal. Simply comparing to the theoretical maximum of 1 is often misleading, as the structure of the dataset imposes inherent limits on what is achievable.

This project introduces a data-dependent upper bound on the ASW that hopefully can provide a more meaningful reference point than the fixed value of 1. The upper bound helps answer a key question: How close is my clustering result to the best possible outcome on this specific data?

To compute the upper bound, the method requires a dissimilarity matrix as input.

You can find more details in this arXiv preprint.

Use Cases

The proposed data-dependent upper bound on the Average Silhouette Width (ASW) opens up opportunities for both research and industry applications.

Research (Academic Endeavors)

Confirming global optimality: Certify that an empirically obtained clustering is within $\epsilon$ of the true ASW maximum.

Sharper comparisons across algorithms: Interpret algorithm performance relative to dataset-specific ceilings, rather than the generic $[-1,1]$ scale.

Industry (Practical Applications)

Early stopping in optimization loops: Halt ASW-based searches once solutions are provably close to optimal, saving time and resources.

Proxy for clusterability: A low bound indicates limited potential for meaningful clusters, guiding analysts before heavy computation.

Outlier detection: Pointwise upper bounds flag observations that cannot fit well into any cluster.

Constraint-aware clustering: Incorporate application constraints (e.g. minimal cluster size) via restricted bounds.

Extension to other clustering quality measures

In practical scenarios where clusters are imbalanced and small groups also matter, the so called macro-averaged silhouette (see this article) might be more favorable compared to the standard ASW. The macro-silhouette averages first at the cluster level and then across clusters.

In this project we implement an upper bound of this silhouette variant in a solution space that is constrained by fixed cluster sizes.

Installation

pip install silhouette-upper-bound

Examples

To help you get started, we provide example scripts demonstrating common use cases. You can find these in the demos/ folder.

Quickstart

import numpy as np
from silhouette_upper_bound import upper_bound

if __name__ == '__main__':

    np.random.seed(42)

    # dummy data
    A = np.random.rand(100, 100)
    D = (A + A.T) / 2
    np.fill_diagonal(D, 0)

    # ASW upper bound
    ub = upper_bound(D)

    print(f"There is no clustering of the data points of D that has a higher Silhouette score than {ub}.")

Experimental results

We evaluate the performance of the upper bound using synthetic datasets generated with scikit-learn’s make_blobs() function. Each dataset is identified by a label of the form n_samples-n_features-centers-cluster_std, which corresponds to the parameters used in the data generation.

The code that generates the results below can be found in experiments/.

Dataset KMeans ASW ASW upper bound Worst-case relative error
400-64-5-6 0.249 0.376 0.38
400-64-2-2 0.673 0.673 .00
400-128-7-3 0.522 0.566 0.08
1000-161-2-13 0.084 0.182 0.54

Note that the upper bound confirms global optimality for KMeans on dataset 400-64-2-2.

More comprehensive results on synthetic datasets are available in results/.

Contribution

Contributions are welcome! If you have suggestions for improvements, bug reports, or new features, feel free to open an issue or submit a pull request.

To contribute:

  1. Fork the repository.
  2. Create a new branch for your feature or fix.
  3. Submit a pull request.

Thank you for helping improve this project!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

silhouette_upper_bound-0.1.1.tar.gz (259.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

silhouette_upper_bound-0.1.1-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file silhouette_upper_bound-0.1.1.tar.gz.

File metadata

  • Download URL: silhouette_upper_bound-0.1.1.tar.gz
  • Upload date:
  • Size: 259.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for silhouette_upper_bound-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a8e0a5bbbfbb6fcc3305c0b604fb89982a47cf82f713ef9c24d9617aba26fe5d
MD5 074bdcb142fffa25c6dce2ae3c1f0aad
BLAKE2b-256 e0694390892ee697fa0a5302ac3f9fb5c7e73eb953b40da793dbfc7158f91903

See more details on using hashes here.

File details

Details for the file silhouette_upper_bound-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for silhouette_upper_bound-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8b7e7b226d544215f8657bc0178771fae040695094ab18e979f4861289e8111f
MD5 fe25f4d0e1a3f0080928dd7a5ced25bb
BLAKE2b-256 cb2dad7b5dd84a5dc5bb5c8923c653eb8530c477433f9097d847ea63218a94c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page