Skip to main content

Validate clustering results

Project description

validclust

Validate clustering results

Linux Build Status PyPI version

Motivation

Clustering algorithms often require that the analyst specify the number of clusters that exist in the data, a parameter commonly known as k. One approach to determining an appropriate value for k is to cluster the data using a range of values for k, then evaluate the quality of the resulting clusterings using a cluster validity index (CVI). The value of k that results in the best partitioning of the data according to the CVI is then chosen. validclust handles this process for the analyst, making it very easy to quickly determine an optimal value for k.

Installation

You can get the stable version from PyPI:

pip install validclust

Or the development version from GitHub:

pip install git+https://github.com/crew102/validclust.git

Basic usage

1. Load libraries.

import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from validclust.validclust import ValidClust

2. Create some synthetic data. The data will be clustered around 4 centers.

data, _ = make_blobs(n_samples=500, centers=4, n_features=5, random_state=0)

3. Use ValidClust to determine the optimal number of clusters. The code below will partition the data into 2-7 clusters using two different clustering algorithms, then calculate various CVIs across the results.

vclust = ValidClust(
    k=list(range(2, 8)), 
    methods=['hierarchical', 'kmeans']
)
cvi_vals = vclust.fit_predict(data)
print(cvi_vals)
#>                                    2            3            4            5  \
#> method       index                                                            
#> hierarchical silhouette     0.645563     0.633970     0.747064     0.583724   
#>              calinski    1007.397799  1399.552836  3611.526187  2832.925655   
#>              davies         0.446861     0.567859     0.361996     1.025296   
#>              dunn           0.727255     0.475745     0.711415     0.109312   
#> kmeans       silhouette     0.645563     0.633970     0.747064     0.602562   
#>              calinski    1007.397799  1399.552836  3611.526187  2845.143428   
#>              davies         0.446861     0.567859     0.361996     0.988223   
#>              dunn           0.727255     0.475745     0.711415     0.115113   
#> 
#>                                    6            7  
#> method       index                                 
#> hierarchical silhouette     0.435456     0.289567  
#>              calinski    2371.222506  2055.323553  
#>              davies         1.509404     1.902413  
#>              dunn           0.109312     0.116557  
#> kmeans       silhouette     0.468945     0.334379  
#>              calinski    2389.531071  2096.945591  
#>              davies         1.431102     1.722117  
#>              dunn           0.098636     0.072423  

It's hard to see what the optimal value of k is from the raw CVI values shown above. Not all of the CVIs are on a 0-1 scale, and lower scores are actually associated with better clusterings for some of the indices. ValidClust's plot() method solves this problem by first normalizing the CVIs and then displaying the results in a heatmap.

vclust.plot()

For each row in the above grid (i.e., for each clustering method/CVI pair), darker cells are associated with higher-quality clusterings. From this plot we can see that each method/index pair seems to be pointing to 4 as being an optimal value for k.

Project details


Release history Release notifications

This version
History Node

0.1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
validclust-0.1.0-py2.py3-none-any.whl (7.0 kB) Copy SHA256 hash SHA256 Wheel py2.py3
validclust-0.1.0.tar.gz (7.5 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page