Clustering C Index
Python native implementation of C Index described by Hubert & Levin
Calculates Hubert & Levin C index - internal cluster quality index
The C Index was described by Hubert & Levin (1976) in “A General Statistical Framework for Assessing Categorical Clustering in Free Recall.” Psychological Bulletin, 83(6), 1072–1080. Its purpose is to compare the dispersion of clusters of data relative to the total dispersion in a dataset. Ideally, the value of the number of clusters that minimizes the C Index will also be the optimal number of clusters to partition a dataset.
The C Index is calculated as
cindex = (Sw - Smin) / (Smax - Smin)
Sw is the sum of within-cluster distance measurements (only point-wise combinations of data are summed within each cluster - not between clusters) Smin is the sum of the Nw smallest point-wise distances between points within the entire dataset Smax is the sum of the nw largest point-wise distances between points within the entire dataset Nw is the total number of pairs of observations belonging to the same cluster. It is the same as total combinations of points within clusters taken two at a time
pip install c_index
from c_index import (calc_c_index, calc_cindex_clusterSim_implementation, calc_cindex_nbclust_implementation, pdist_array) xs = np.array([[1,2,1.5,1.75,1.33,0.88], [5,5.5,4.88,6.33,5.01,4.95]]) # Cluster 1, 2 x values ys = np.array([[8.9,8.5,7.89,8.25,8.85,8.29], [1.25,1.14,1.85,0.85,0.79,0.96]]) # Cluster 1,2 y values X = np.stack((xs.ravel(),ys.ravel()), axis=1) cluster_labels = np.array([0,0,0,0,0,0,1,1,1,1,1,1]) cindex = calc_c_index(X, cluster_labels) # nbclust implementation takes an array of pointwise differences distances_array = pdist_array(X) # NbClust C Index python equivalent cindex = calc_cindex_nbclust_implementation(distances_array, cluster_labels) # clusterSim C Index python equivalent cindex = calc_cindex_clusterSim_implementation(distances_array, cluster_labels)
This packages implementation differs from these packages in a couple ways.
Note these two lines (R Code) :
Dmin=sum(sort(ddist)[1:r]) Dmax=sum(sort(ddist,decreasing = T)[1:r])
THey include the whole distance array, which includes all permutations of distances between points (instead of combinations). This means the high end and low end are double counted. I don't think that is the correct way to calculate C Index, but maybe they have a specific reason for doing so.
Note these lines (R Code) :
Dmin = min(v_min) Dmax = max(v_max) result <- (DU - r * Dmin)/(Dmax * r - Dmin * r)
Instead of sorting and calculating the Nw minimum and maximum distance measurements between points, they simply multiply 'r' by the minimum and maximum values of distances. I'm not sure what their motivation was for doing that, but sometimes it does give a better estimate of the number of clusters than the described method above.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size c_index-0.0.2-py3-none-any.whl (13.0 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size c_index-0.0.2.tar.gz (10.5 kB)||File type Source||Python version None||Upload date||Hashes View|