Validate clustering results

Project description

validclust

Validate clustering results

Motivation

Clustering algorithms often require that the analyst specify the number of clusters that exist in the data, a parameter commonly known as k. One approach to determining an appropriate value for k is to cluster the data using a range of values for k, then evaluate the quality of the resulting clusterings using a cluster validity index (CVI). The value of k that results in the best partitioning of the data according to the CVI is then chosen. validclust handles this process for the analyst, making it very easy to quickly determine an optimal value for k.

Installation

You can get the stable version from PyPI:

pip install validclust

Or the development version from GitHub:

pip install git+https://github.com/crew102/validclust.git

Basic usage

1. Load libraries.

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from validclust import ValidClust

2. Create some synthetic data. The data will be clustered around 4 centers.

data, _ = make_blobs(n_samples=500, centers=4, n_features=5, random_state=0)

3. Use ValidClust to determine the optimal number of clusters. The code below will partition the data into 2-7 clusters using two different clustering algorithms, then calculate various CVIs across the results.

vclust = ValidClust(
    k=list(range(2, 8)), 
    methods=['hierarchical', 'kmeans']
)
cvi_vals = vclust.fit_predict(data)
print(cvi_vals)
#>                                    2            3            4            5  \
#> method       index                                                            
#> hierarchical silhouette     0.645563     0.633970     0.747064     0.583724   
#>              calinski    1007.397799  1399.552836  3611.526187  2832.925655   
#>              davies         0.446861     0.567859     0.361996     1.025296   
#>              dunn           0.727255     0.475745     0.711415     0.109312   
#> kmeans       silhouette     0.645563     0.633970     0.747064     0.602562   
#>              calinski    1007.397799  1399.552836  3611.526187  2845.143428   
#>              davies         0.446861     0.567859     0.361996     0.988223   
#>              dunn           0.727255     0.475745     0.711415     0.115113   
#> 
#>                                    6            7  
#> method       index                                 
#> hierarchical silhouette     0.435456     0.289567  
#>              calinski    2371.222506  2055.323553  
#>              davies         1.509404     1.902413  
#>              dunn           0.109312     0.116557  
#> kmeans       silhouette     0.468945     0.334379  
#>              calinski    2389.531071  2096.945591  
#>              davies         1.431102     1.722117  
#>              dunn           0.098636     0.072423

It's hard to see what the optimal value of k is from the raw CVI values shown above. Not all of the CVIs are on a 0-1 scale, and lower scores are actually associated with better clusterings for some of the indices. ValidClust's plot() method solves this problem by first normalizing the CVIs and then displaying the results in a heatmap.

vclust.plot()

For each row in the above grid (i.e., for each clustering method/CVI pair), darker cells are associated with higher-quality clusterings. From this plot we can see that each method/index pair seems to be pointing to 4 as being an optimal value for k.

Project details

Release history Release notifications | RSS feed

This version

0.1.1

May 27, 2021

0.1.0

Feb 1, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

validclust-0.1.1.tar.gz (8.1 kB view details)

Uploaded May 27, 2021 Source

Built Distribution

validclust-0.1.1-py2.py3-none-any.whl (8.1 kB view details)

Uploaded May 27, 2021 Python 2Python 3

File details

Details for the file validclust-0.1.1.tar.gz.

File metadata

Download URL: validclust-0.1.1.tar.gz
Upload date: May 27, 2021
Size: 8.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.2.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.2

File hashes

Hashes for validclust-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`71aef56caf2a8eecb15aff1c299b756ff1a1acca405672fbe8ee346e68d55e86`
MD5	`c5eb7f62b88aeac6dfb42d3ee5eef152`
BLAKE2b-256	`15633e5db7bdd159dbfee03745f385a53e6bca2b3691cf648059abe8efbe0cf1`

See more details on using hashes here.

File details

Details for the file validclust-0.1.1-py2.py3-none-any.whl.

File metadata

Download URL: validclust-0.1.1-py2.py3-none-any.whl
Upload date: May 27, 2021
Size: 8.1 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.2.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.2

File hashes

Hashes for validclust-0.1.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`f4926952d289334f6f4ceaccec231f1d91e7c95073c15a631c0de620cff88b02`
MD5	`8e11c8b33336b7d59ca4627176edf897`
BLAKE2b-256	`e993fabe0ee375a3293935f489180eda9baaa22be75bf924f266e094cfafa2b2`

See more details on using hashes here.

validclust 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

validclust

Motivation

Installation

Basic usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes