Package useful for clustering validation
Project description
clusterval
For validating clustering results
Motivation
This package was made to facilitate the process of clustering of a dataset. The user needs only to specifiy the dataset or the pairwise distances
in the form of a list-like structure and clustering will be performed as well as evaluation of the results, through the
use of CVIs (Clustering Validation Indices). clusterval
outputs the best partition of the data, a dendrogram and k
,
the number of clusters. Clustering algorithms available are: 'single', 'complete', 'ward', 'centroid', 'average' and 'kmeans.
Installation
You can get the stable version from PyPI:
pip install clusterval
Or the development version from GitHub:
pip install git+https://github.com/Nuno09/clusterval.git
Basic usage
1. Load libraries.
from clusterval import Clusterval
from sklearn.datasets import load_iris, make_blobs
2. Let's use the iris dataset
data = load_iris()['data']
3. Use clusterval
to determine the optimal number of clusters. The code below will create a Clusterval
object that for an input dataset will partition the data
into 2-8 clusters using hierarchical aglomerative clustering, with ward criteria, then calculate various CVIs across the
results.
c = Clusterval()
c.evaluate(data)
Outupt:
Clusterval(min_k=2, max_k=8, algorithm=ward, bootstrap_samples=250, index=['all'])
final_k = 2
4. If user wishes more information on the resulting clustering just type below command.
print(c.long_info)
Long output:
* Minimum number of clusters to test: 2
* Maximum number of clusters to test: 8
* Number of bootstrap samples generated: 250
* Clustering algorithm used: ward
* Validation Indices calculated: ['AR', 'FM', 'J', 'AW', 'VD', 'H', 'F', 'VI', 'K', 'Phi', 'RT', 'SS', 'CVNN', 'XB', 'SDbw', 'DB', 'S', 'SD', 'PBM', 'Dunn']
* Among all indices:
* According to the majority rule, the best number of clusters is 2
* 15 proposed 2 as the best number of clusters
* 3 proposed 3 as the best number of clusters
* 1 proposed 6 as the best number of clusters
* 1 proposed 8 as the best number of clusters
***** Conclusion *****
AR FM J AW VD H F VI K Phi RT SS CVNN XB SDbw DB S SD PBM \
2 0.999426 0.526738 0.356955 0.000331 0.385982 0.000283 0.526045 1.312383 0.527433 2.974318e-06 0.333611 0.217291 1.000000 0.261539 0.752460 0.191376 0.722234 10.714331 20.485220
3 1.000172 0.379244 0.233104 -0.000791 0.525643 -0.000891 0.378001 1.972355 0.380492 -9.809481e-06 0.356957 0.131954 0.974122 0.414386 0.791627 0.388819 0.560392 10.764615 25.473608
4 1.000080 0.285500 0.166032 -0.000607 0.586732 -0.000677 0.284666 2.511133 0.286337 -8.672063e-06 0.418338 0.090562 1.233481 0.610347 0.798949 0.448981 0.458409 12.500304 19.111932
5 1.000070 0.263834 0.150461 -0.000815 0.591750 -0.000972 0.261470 2.676828 0.266223 -1.296032e-05 0.434558 0.081374 1.552875 0.578900 0.699964 0.527533 0.436093 15.188085 18.932200
6 1.000042 0.192872 0.106553 -0.000079 0.675911 -0.000065 0.192512 3.096166 0.193233 -8.572647e-07 0.524317 0.056289 1.518029 0.767859 0.695073 0.580975 0.367109 18.297956 16.657177
7 1.000036 0.170423 0.092730 -0.001502 0.687232 -0.001626 0.169647 3.246090 0.171204 -2.903045e-05 0.554676 0.048633 1.479109 0.884422 0.743185 0.737307 0.335528 21.942174 13.619411
8 1.000032 0.155010 0.083478 -0.000446 0.694161 -0.000496 0.154014 3.349153 0.156014 -9.595088e-06 0.581493 0.043571 1.455532 0.884422 0.701268 0.681691 0.370146 22.229196 11.636157
Dunn
2 0.338909
3 0.112795
4 0.123508
5 0.123508
6 0.131081
7 0.131081
8 0.150756
* The best partition is:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1]
4. It's also possible to visualize the hierarchical clustering. Note: in linux systems installation of library "python3-tk" might be needed.
c.plot()
5. The user can also change some execution parameters. For example, clustering algorithm, range of k
to test,
bootstrap simulations and CVI to use.
data, _ = make_blobs(n_samples=700, centers=10, n_features=5, random_state=0)
c = Clusterval(min_k=5, max_k=15, algorithm='kmeans', bootstrap_samples=200, index='CVNN')
c.evaluate(data)
Output:
Clusterval(min_k=5, max_k=15, algorithm=kmeans, bootstrap_samples=200, index=['CVNN'])
final_k = 10
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file clusterval-1.0.5.tar.gz
.
File metadata
- Download URL: clusterval-1.0.5.tar.gz
- Upload date:
- Size: 92.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/50.3.1 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c47cf86cc8c0e02b17a59fba4e540c1841aab7c996500259920f2149d9116242 |
|
MD5 | 291ffba42bd63c0f3e4133f4ddf05776 |
|
BLAKE2b-256 | 6de473f8a34dbac352ab06ef7835d22994f8e2b9de1c7c876d8f3cf08264b0bd |