Skip to main content

Estimate the true number of clusters for k-means clustering using the Cluster Consistency Criterion (CCC).

Project description

Estimate the true number of clusters for k-means clustering using the Cluster Consistency Criterion (CCC).

This algorithm follows the rationale that true cluster centres should be similar in random split-halves of the data. If too maby clusters are specified, the cluster centres will become driven by random sampling error.

The CCC implements this as follows. For each number of clusters, the data are split into random halves for a given number of splits (e.g., 20). For each sp0lit, a k-means cluster analysis is run on each half separately. The distances between most-similar cluster centres are summed. The similarity score is e^(-distance_sum). The mean similarity score over random splits is the score for the given number of clusters.

The best estimate of the true number of clusters is determined by where the score drops off, which occurs when the number of clusters becomes higher than the true number of clusters.

The file test.py gives an example and simulation script. Usage is:

O = teg_CCC.get_best_k_CCC(X)

where X is a 2D array of shape N_Observations x N_Variables. There is an optional argument for max_n_clusters, set to 10 by default. The output is a dictionary with the estimate of true cluster centres (best_n) as well as the similarity score per number of clusters (scores_per_n) and the associated number of clusters (n_vec).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

teg_CCC-0.0.1.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

teg_CCC-0.0.1-py3-none-any.whl (3.5 kB view details)

Uploaded Python 3

File details

Details for the file teg_CCC-0.0.1.tar.gz.

File metadata

  • Download URL: teg_CCC-0.0.1.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.5

File hashes

Hashes for teg_CCC-0.0.1.tar.gz
Algorithm Hash digest
SHA256 cbd598c9e50ce3d226b6319d77f23a6a94436051d718f1229a6f29c19ea5733e
MD5 36408b56de4425195f766ecaab680039
BLAKE2b-256 e29e064852f5f2ce3ee7d1073ee09458aea57f91965ab7a811edb848a5189ff4

See more details on using hashes here.

File details

Details for the file teg_CCC-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: teg_CCC-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 3.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.5

File hashes

Hashes for teg_CCC-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 272a9b4f02499c78d018b66bb1d0a6e47cf8c05c7ec3f54e75e64238bf60cc2b
MD5 db9216e6b479e905e55ea418e699a5db
BLAKE2b-256 b3ea7756bb90b14a1748ea9c887cf7691fe6359d32ece2b1196124d6f43d7fa1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page