Validate clustering results
Project description
validclust
Validate clustering results
Motivation
Clustering algorithms often require that the analyst specify the number of clusters that exist in the data, a parameter commonly known as k
. One approach to determining an appropriate value for k
is to cluster the data using a range of values for k
, then evaluate the quality of the resulting clusterings using a cluster validity index (CVI). The value of k
that results in the best partitioning of the data according to the CVI is then chosen. validclust
handles this process for the analyst, making it very easy to quickly determine an optimal value for k
.
Installation
You can get the stable version from PyPI:
pip install validclust
Or the development version from GitHub:
pip install git+https://github.com/crew102/validclust.git
Basic usage
1. Load libraries.
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from validclust import ValidClust
2. Create some synthetic data. The data will be clustered around 4 centers.
data, _ = make_blobs(n_samples=500, centers=4, n_features=5, random_state=0)
3. Use ValidClust
to determine the optimal number of clusters. The code below will partition the data into 2-7 clusters using two different clustering algorithms, then calculate various CVIs across the results.
vclust = ValidClust(
k=list(range(2, 8)),
methods=['hierarchical', 'kmeans']
)
cvi_vals = vclust.fit_predict(data)
print(cvi_vals)
#> 2 3 4 5 \
#> method index
#> hierarchical silhouette 0.645563 0.633970 0.747064 0.583724
#> calinski 1007.397799 1399.552836 3611.526187 2832.925655
#> davies 0.446861 0.567859 0.361996 1.025296
#> dunn 0.727255 0.475745 0.711415 0.109312
#> kmeans silhouette 0.645563 0.633970 0.747064 0.602562
#> calinski 1007.397799 1399.552836 3611.526187 2845.143428
#> davies 0.446861 0.567859 0.361996 0.988223
#> dunn 0.727255 0.475745 0.711415 0.115113
#>
#> 6 7
#> method index
#> hierarchical silhouette 0.435456 0.289567
#> calinski 2371.222506 2055.323553
#> davies 1.509404 1.902413
#> dunn 0.109312 0.116557
#> kmeans silhouette 0.468945 0.334379
#> calinski 2389.531071 2096.945591
#> davies 1.431102 1.722117
#> dunn 0.098636 0.072423
It's hard to see what the optimal value of k
is from the raw CVI values shown above. Not all of the CVIs are on a 0-1 scale, and lower scores are actually associated with better clusterings for some of the indices. ValidClust
's plot()
method solves this problem by first normalizing the CVIs and then displaying the results in a heatmap.
vclust.plot()
For each row in the above grid (i.e., for each clustering method/CVI pair), darker cells are associated with higher-quality clusterings. From this plot we can see that each method/index pair seems to be pointing to 4 as being an optimal value for k
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for validclust-0.1.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4926952d289334f6f4ceaccec231f1d91e7c95073c15a631c0de620cff88b02 |
|
MD5 | 8e11c8b33336b7d59ca4627176edf897 |
|
BLAKE2b-256 | e993fabe0ee375a3293935f489180eda9baaa22be75bf924f266e094cfafa2b2 |