Skip to main content

Calculate various internal clustering validation or quality criteria.

Project description


Cluster Criteria

Build PyPI version fury.io

Description

The project is a python extension for the cluster crit R package. The project also contains the ability to select the optimal results when running Cluster Criteria algorithms on any number of clusters.

External Dependencies

The R programming language is a dependency of this project, and it must be installed prior to installing this project. Please visit the R Downloads Page.

Internal Criteria

The function intCriteria calculates internal clustering indices. The list of all internal criteria can be found in criteria.py.

External Criteria

The function extCriteria calculates external clustering indices in order to compare two partitions. The list of all external criteria can be found in criteria.py.

Best Criterion

Given a vector of several clustering quality index values computed with a given criterion, the function bestCriterion returns the index of the "best" one in the sense of the specified criterion. Typically, a set of data has been clusterized several times (using different algorithms or specifying a different number of clusters) and a clustering index has been calculated each time. The bestCriterion function determines which value is considered the best according to the given clustering index. For instance, if one uses the Calinski_Harabasz index, the best value is the largest one. A list of all the supported criteria can be obtained with the getCriteriaNames function. The criterion name (crit argument) is case insensitive and can be abbreviated.

Get Criteria Names

Get a list of Criteria Names.

  • The user can return Internal vs External Criteria Names by setting the internal variable to True vs False respectively.
  • When retrieving Internal Criteria, the user can set includeGCI to False to skip returning any criteria with GDI-XXX as the name.
  • The user can also control the return type by setting returnEnumerations to True (return Enumerations) or False to return the string representations of the criteria.

Examples

The following sections are a set of brief/simple examples of this library. To setup/initialize these tests, you can use the following steps:

  1. Install All External Dependencies (see external dependencies above).
  2. Install kmeans1d: python -m pip install kmeans1d
  3. Create the original set of data (this is a sample taken from a large data set).
original = [
    -0.018, -0.03, 0.025, -0.073, -0.007, 0.052, -0.042, -0.025, -0.056, 0.005,
    0.131, 0.059, 0.15, 0.157, 0.036, 0.096, -0.027, -0.002, 0.069, 0.099,
    0.067, 0.101, 0.105, 0.115, 0.108, -0.036, -0.109, -0.133, -0.061, -0.045,
    -0.058, 0.017, 0.007, -0.093, 0.077, 0.085, 0.1, -0.005, 0.009, 0.16
]

Note: It is advised that you convert this data when it is a 1-D data set like above.

import numpy as np

original = np.asarray([
    -0.018, -0.03, 0.025, -0.073, -0.007, 0.052, -0.042, -0.025, -0.056, 0.005,
    0.131, 0.059, 0.15, 0.157, 0.036, 0.096, -0.027, -0.002, 0.069, 0.099,
    0.067, 0.101, 0.105, 0.115, 0.108, -0.036, -0.109, -0.133, -0.061, -0.045,
    -0.058, 0.017, 0.007, -0.093, 0.077, 0.085, 0.1, -0.005, 0.009, 0.16
])
original = original.reshape(-1, 1)
  1. Create a wrapper for kmeans so that we can generate the clusters for the above data set.
def k_means_wrapper(data_set, k):
    matching_clusters, centroids = kmeans1dc(data_set, k)
    # R uses values 1-N not 0-N-1, so let's update here
    matching_clusters = [x+1 for x in matching_clusters]
    return matching_clusters
  1. Cluster the data using values of K = 2,3,4,5,6 ...
clusters = [
    k_means_wrapper(original, 2),
    k_means_wrapper(original, 3),
    k_means_wrapper(original, 4),
    k_means_wrapper(original, 5),
    k_means_wrapper(original, 6)
]

You should now have values similar to the following:

clusters = [
    [1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2],
    [2, 2, 2, 1, 2, 3, 1, 2, 1, 2, 3, 3, 3, 3, 2, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 2, 2, 1, 3, 3, 3, 2, 2, 3],
    [2, 2, 2, 1, 2, 3, 1, 2, 1, 2, 4, 3, 4, 4, 2, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 2, 2, 1, 3, 3, 3, 2, 2, 4],
    [2, 2, 3, 2, 3, 4, 2, 2, 2, 3, 5, 4, 5, 5, 3, 4, 2, 3, 4, 4, 4, 4, 4, 4, 4, 2, 1, 1, 2, 2, 2, 3, 3, 1, 4, 4, 4, 3, 3, 5],
    [3, 2, 3, 2, 3, 4, 2, 2, 2, 3, 6, 4, 6, 6, 4, 5, 2, 3, 4, 5, 4, 5, 5, 5, 5, 2, 1, 1, 2, 2, 2, 3, 3, 1, 4, 5, 5, 3, 3, 6],
]

These clusters can be used as parameters to IntCriteria. Follow similar steps to produce data for ExtCriteria.

Internal Criteria

The following will receive the results of the clusters with the Dunn Criteria on cluster numbers two through six.

criteria = CriteriaInternal.Dunn

values = []
for cluster in clusters:
    output = intCriteria(original, cluster, [criteria])
    values.append(output[criteria.name])

External Criteria

from random import randint

# generate two artificial partitions
part1 = [randint(1,3) for _ in range(150)]
part2 = [randint(1,5) for _ in range(150)]

output = extCriteria(part1, part2, [CriteriaExternal.Czekanowski_Dice])

Best Criterion

Continuing with the IntCriteria example above, the following will print the index of the best cluster size given the outputs of the Internal Crtieria evaluation.

crit = np.asarray(values)
print(bestCriterion(crit, criteria.name))

Get Criteria Names

The example will get all InternalCriteria, excluding GDI-XXX criteria, and the values will be returned as enumerations.

criteria = getCriteriaNames(True, False, True)

The provided parameters are defaults, and they do not need to be specified.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cluster-crit-1.0.1.tar.gz (7.3 kB view hashes)

Uploaded Source

Built Distribution

cluster_crit-1.0.1-py3-none-any.whl (8.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page