An upper bound of the Average Silhouette Width
Project description
Silhouette Upper Bound
An upper bound of the Average Silhouette Width.
Figure 1: Kmeans clustering applied to a synthetic dataset. Code available
here.
Overview
Evaluating clustering quality is a fundamental task in cluster analysis, and the Average Silhouette Width (ASW) is one of the most widely used metrics for this purpose. ASW scores range from $-1$ to $1$, where:
-
Values near 1 indicate well-separated, compact clusters
-
Values around 0 suggest overlapping or ambiguous cluster assignments
-
Values near -1 imply that many points may have been misassigned
Optimizing the Silhouette score is a common objective in clustering workflows. However, since we rarely know the true global ASW-maximum achievable for a dataset, it's difficult to assess how close a given clustering result is to being globally optimal. Simply comparing to the theoretical maximum of 1 is often misleading, as the structure of the dataset imposes inherent limits on what is achievable.
This project introduces a data-dependent upper bound on the ASW that hopefully can provide a more meaningful reference point than the fixed value of 1. The upper bound helps answer a key question: How close is my clustering result to the best possible outcome on this specific data?
To compute the upper bound, the method requires a dissimilarity matrix as input.
You can find more details in this arXiv preprint.
Installation
pip install silhouette-upper-bound
Examples
To help you get started, we provide example scripts demonstrating common use cases.
You can find these in the demos/ folder.
Quickstart
import numpy as np
from silhouette_upper_bound import upper_bound
if __name__ == '__main__':
np.random.seed(42)
# dummy data
A = np.random.rand(100, 100)
D = (A + A.T) / 2
np.fill_diagonal(D, 0)
# ASW upper bound
ub = upper_bound(D)
print(f"There is no clustering of the data points of D that has a higher Silhouette score than {ub}.")
Experimental results
We evaluate the performance of the upper bound using synthetic datasets generated with scikit-learn’s make_blobs() function. Each dataset is identified by a label of the form n_samples-n_features-centers-cluster_std, which corresponds to the parameters used in the data generation.
The code that generates the results below can be found in
experiments/.
| Dataset | KMeans ASW | ASW upper bound | Worst-case relative error |
|---|---|---|---|
| 400-64-5-6 | 0.249 | 0.376 | 0.38 |
| 400-64-2-2 | 0.673 | 0.673 | .00 |
| 400-128-7-3 | 0.522 | 0.566 | 0.08 |
| 10000-32-20-2 | 0.626 | 0.774 | 0.19 |
Note that the upper bound confirms global optimality for KMeans on dataset 400-64-2-2.
More comprehensive results on synthetic datasets are available in results/.
Contribution
Contributions are welcome! If you have suggestions for improvements, bug reports, or new features, feel free to open an issue or submit a pull request.
To contribute:
- Fork the repository.
- Create a new branch for your feature or fix.
- Submit a pull request.
Thank you for helping improve this project!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file silhouette_upper_bound-0.1.2.tar.gz.
File metadata
- Download URL: silhouette_upper_bound-0.1.2.tar.gz
- Upload date:
- Size: 721.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2fecd99c2b38b07e9c176adbb2a60a80419842edccb3f10243f1c75daa6ee06
|
|
| MD5 |
a16542f7ddcd2e5e80cca33ef5bc8ecd
|
|
| BLAKE2b-256 |
2e9081b5a423ae69b7bedd4c5d6906c91e037a69405cf42a61ccc20b87222a85
|
File details
Details for the file silhouette_upper_bound-0.1.2-py3-none-any.whl.
File metadata
- Download URL: silhouette_upper_bound-0.1.2-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e01c3315cb900a8e1f202b4a57054ef6704440330381b21c13fc7224dd4c8474
|
|
| MD5 |
eb55de18abe306ffd128f4396868d456
|
|
| BLAKE2b-256 |
3b74361f07195c652e801de816b4c1e816334b5d6286372e6b9b19385d6afbe7
|