Skip to main content

Clustering C Index

Project description

Python native implementation of C Index described by Hubert & Levin

Description

Calculates Hubert & Levin C index - internal cluster quality index

The C Index was described by Hubert & Levin (1976) in “A General Statistical Framework for Assessing Categorical Clustering in Free Recall.” Psychological Bulletin, 83(6), 1072–1080. Its purpose is to compare the dispersion of clusters of data relative to the total dispersion in a dataset. Ideally, the value of the number of clusters that minimizes the C Index will also be the optimal number of clusters to partition a dataset.

More Description

The C Index is calculated as cindex = (Sw - Smin) / (Smax - Smin)

Sw is the sum of within-cluster distance measurements (only point-wise combinations of data are summed within each cluster - not between clusters) Smin is the sum of the Nw smallest point-wise distances between points within the entire dataset Smax is the sum of the nw largest point-wise distances between points within the entire dataset Nw is the total number of pairs of observations belonging to the same cluster. It is the same as total combinations of points within clusters taken two at a time

Installation

pip install c_index

Example usage

from c_index import (calc_c_index, calc_cindex_clusterSim_implementation,
calc_cindex_nbclust_implementation,
pdist_array)

xs = np.array([[1,2,1.5,1.75,1.33,0.88],
                [5,5.5,4.88,6.33,5.01,4.95]]) # Cluster 1, 2 x values
ys = np.array([[8.9,8.5,7.89,8.25,8.85,8.29],
                [1.25,1.14,1.85,0.85,0.79,0.96]]) # Cluster 1,2 y values
X = np.stack((xs.ravel(),ys.ravel()), axis=1)
cluster_labels = np.array([0,0,0,0,0,0,1,1,1,1,1,1])

cindex = calc_c_index(X, cluster_labels)

# nbclust implementation takes an array of pointwise differences
distances_array = pdist_array(X) 

# NbClust C Index python equivalent
cindex = calc_cindex_nbclust_implementation(distances_array, cluster_labels)

# clusterSim C Index python equivalent
cindex = calc_cindex_clusterSim_implementation(distances_array, cluster_labels)

The C Index is already implemented in a number of R Packages, including clusterSim, clusterCrit, and NbClust

This packages implementation differs from these packages in a couple ways.

ClusterSim differences

Note these two lines (R Code) :

	Dmin=sum(sort(ddist)[1:r])
	Dmax=sum(sort(ddist,decreasing = T)[1:r])

THey include the whole distance array, which includes all permutations of distances between points (instead of combinations). This means the high end and low end are double counted. I don't think that is the correct way to calculate C Index, but maybe they have a specific reason for doing so.

NbClust differences

Note these lines (R Code) :

    Dmin = min(v_min)
    Dmax = max(v_max)
    result <- (DU - r * Dmin)/(Dmax * r - Dmin * r)

Instead of sorting and calculating the Nw minimum and maximum distance measurements between points, they simply multiply 'r' by the minimum and maximum values of distances. I'm not sure what their motivation was for doing that, but sometimes it does give a better estimate of the number of clusters than the described method above.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

c_index-0.0.2.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

c_index-0.0.2-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file c_index-0.0.2.tar.gz.

File metadata

  • Download URL: c_index-0.0.2.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3

File hashes

Hashes for c_index-0.0.2.tar.gz
Algorithm Hash digest
SHA256 6190fde5df70b0e803be086c9faed2eb4fc8d5aafaab31e4bf2c389ffc9e08b7
MD5 59077296b085237cc5e696e565ae0e4f
BLAKE2b-256 b482e296e53b8600aea61782ec310860ce122877e1a410efa725513e7c8592a4

See more details on using hashes here.

File details

Details for the file c_index-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: c_index-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3

File hashes

Hashes for c_index-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 760767f98671f77ba5aa44bef5f262d97a5aafcba06aba14bc1def90626b7fc6
MD5 f64d9e62a0591203b5b713e3a448ab3e
BLAKE2b-256 652b8467586075122077570e72c0b843c62e9eda5303aa8c3cd66bc8cb30c0dc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page