Skip to main content

Calculate various internal clustering validation or quality criteria.

Project description


Cluster Criteria

Build PyPI version fury.io

Description

The project is a python extension for the cluster crit R package. The project also contains the ability to select the optimal results when running Cluster Criteria algorithms on any number of clusters.

External Dependencies

The R programming language is a dependency of this project, and it must be installed prior to installing this project. Please visit the R Downloads Page.

Internal Criteria

The function intCriteria calculates internal clustering indices. The list of all internal criteria can be found in criteria.py.

External Criteria

The function extCriteria calculates external clustering indices in order to compare two partitions. The list of all external criteria can be found in criteria.py.

Best Criterion

Given a vector of several clustering quality index values computed with a given criterion, the function bestCriterion returns the index of the "best" one in the sense of the specified criterion. Typically, a set of data has been clusterized several times (using different algorithms or specifying a different number of clusters) and a clustering index has been calculated each time. The bestCriterion function determines which value is considered the best according to the given clustering index. For instance, if one uses the Calinski_Harabasz index, the best value is the largest one. A list of all the supported criteria can be obtained with the getCriteriaNames function. The criterion name (crit argument) is case insensitive and can be abbreviated.

Get Criteria Names

Get a list of Criteria Names.

  • The user can return Internal vs External Criteria Names by setting the internal variable to True vs False respectively.
  • When retrieving Internal Criteria, the user can set includeGCI to False to skip returning any criteria with GDI-XXX as the name.
  • The user can also control the return type by setting returnEnumerations to True (return Enumerations) or False to return the string representations of the criteria.

Examples

The following sections are a set of brief/simple examples of this library. To setup/initialize these tests, you can use the following steps:

  1. Install All External Dependencies (see external dependencies above).
  2. Install kmeans1d: python -m pip install kmeans1d
  3. Create the original set of data (this is a sample taken from a large data set).
original = [
    -0.018, -0.03, 0.025, -0.073, -0.007, 0.052, -0.042, -0.025, -0.056, 0.005,
    0.131, 0.059, 0.15, 0.157, 0.036, 0.096, -0.027, -0.002, 0.069, 0.099,
    0.067, 0.101, 0.105, 0.115, 0.108, -0.036, -0.109, -0.133, -0.061, -0.045,
    -0.058, 0.017, 0.007, -0.093, 0.077, 0.085, 0.1, -0.005, 0.009, 0.16
]

Note: It is advised that you convert this data when it is a 1-D data set like above.

import numpy as np

original = np.asarray([
    -0.018, -0.03, 0.025, -0.073, -0.007, 0.052, -0.042, -0.025, -0.056, 0.005,
    0.131, 0.059, 0.15, 0.157, 0.036, 0.096, -0.027, -0.002, 0.069, 0.099,
    0.067, 0.101, 0.105, 0.115, 0.108, -0.036, -0.109, -0.133, -0.061, -0.045,
    -0.058, 0.017, 0.007, -0.093, 0.077, 0.085, 0.1, -0.005, 0.009, 0.16
])
original = original.reshape(-1, 1)
  1. Create a wrapper for kmeans so that we can generate the clusters for the above data set.
def k_means_wrapper(data_set, k):
    matching_clusters, centroids = kmeans1dc(data_set, k)
    # R uses values 1-N not 0-N-1, so let's update here
    matching_clusters = [x+1 for x in matching_clusters]
    return matching_clusters
  1. Cluster the data using values of K = 2,3,4,5,6 ...
clusters = [
    k_means_wrapper(original, 2),
    k_means_wrapper(original, 3),
    k_means_wrapper(original, 4),
    k_means_wrapper(original, 5),
    k_means_wrapper(original, 6)
]

You should now have values similar to the following:

clusters = [
    [1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2],
    [2, 2, 2, 1, 2, 3, 1, 2, 1, 2, 3, 3, 3, 3, 2, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 2, 2, 1, 3, 3, 3, 2, 2, 3],
    [2, 2, 2, 1, 2, 3, 1, 2, 1, 2, 4, 3, 4, 4, 2, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 2, 2, 1, 3, 3, 3, 2, 2, 4],
    [2, 2, 3, 2, 3, 4, 2, 2, 2, 3, 5, 4, 5, 5, 3, 4, 2, 3, 4, 4, 4, 4, 4, 4, 4, 2, 1, 1, 2, 2, 2, 3, 3, 1, 4, 4, 4, 3, 3, 5],
    [3, 2, 3, 2, 3, 4, 2, 2, 2, 3, 6, 4, 6, 6, 4, 5, 2, 3, 4, 5, 4, 5, 5, 5, 5, 2, 1, 1, 2, 2, 2, 3, 3, 1, 4, 5, 5, 3, 3, 6],
]

These clusters can be used as parameters to IntCriteria. Follow similar steps to produce data for ExtCriteria.

Internal Criteria

The following will receive the results of the clusters with the Dunn Criteria on cluster numbers two through six.

criteria = CriteriaInternal.Dunn

values = []
for cluster in clusters:
    output = intCriteria(original, cluster, [criteria])
    values.append(output[criteria.name])

External Criteria

from random import randint

# generate two artificial partitions
part1 = [randint(1,3) for _ in range(150)]
part2 = [randint(1,5) for _ in range(150)]

output = extCriteria(part1, part2, [CriteriaExternal.Czekanowski_Dice])

Best Criterion

Continuing with the IntCriteria example above, the following will print the index of the best cluster size given the outputs of the Internal Crtieria evaluation.

crit = np.asarray(values)
print(bestCriterion(crit, criteria.name))

Get Criteria Names

The example will get all InternalCriteria, excluding GDI-XXX criteria, and the values will be returned as enumerations.

criteria = getCriteriaNames(True, False, True)

The provided parameters are defaults, and they do not need to be specified.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cluster-crit-1.0.1.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

cluster_crit-1.0.1-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file cluster-crit-1.0.1.tar.gz.

File metadata

  • Download URL: cluster-crit-1.0.1.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for cluster-crit-1.0.1.tar.gz
Algorithm Hash digest
SHA256 0e54a2c5c49e67b744c25a22b5065eab8e553f350127c695a2ef9705f1c936dc
MD5 895e727803ae52f39eb57a32531d101d
BLAKE2b-256 c78abb62da4fd1d389d5949d05e62a40a96fd5a0c52fc0477afc582c6d9f0c2d

See more details on using hashes here.

File details

Details for the file cluster_crit-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for cluster_crit-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 38bf8dde72245d492f056c9e5980143d5205ec5ae0b851278018f36232d33fe0
MD5 12c4eb8faeb9d1058d6eb1961c1d07b7
BLAKE2b-256 5813679017c414377ca5ae658e7abc4fceecc6bfba10bb02b17adef26ea14593

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page