Skip to main content

A consensus K-Means implementation.

Project description

License: MIT PyPI version Anaconda-Server Badge Coverage Status

DOI

pyckmeans

pyckmeans is a Python package for Consensus K-Means and Weighted Ensemble Consensus of Random (WECR) K-Means clustering, especially in the context of DNA sequence data. To evaluate the quality of clusterings, pyckmeans implements several internal validation metrics.

In addition to the clustering functionality, it provides tools for working with DNA sequence data such as reading and writing of DNA alignment files, calculating genetic distances, and Principle Coordinate Analysis (PCoA) for dimensionality reduction.

Consensus K-Means

Consensus K-Means is an unsupervised ensemble clustering algorithm, combining multiple K-Means clusterings, where each K-Means is trained on a subset of the data (random subset) and a subset of the the features (random subspace). The predicted cluster memberships of the single clusterings are combined to a consensus (or co-association) matrix, determining the number of times each pair of samples was clustered together over all clusterings. This matrix can be interpreted as similarity matrix and can be used to resolve the final consensus clustering by subjecting it to a last clustering step, e.g. hierarchical, or spectral clustering.

WECR K-Means

Weighted Ensemble Consensus of Random (WECR) K-Means is a semi-supervised ensemble clustering algorithm. Similar to consensus K-Means, it is based on a collection of K-Means clusterings, which are each trained on a random subset of data and a random subspace of features. In addition, for each single clustering the number of clusters k is also randomized. This library of clusterings is subjected to weighting function that integrates user-supplied must-link and must-not-link constraints, as well as an internal cluster validation criterion. The constraints represent the semi-supervised component of WECR K-Means: the user can provide prior knowledge considering the composition of the clusters. Must-link and must-not-link constraints imply that a pair of samples (observations, data points) is expected to be found in the same or different clusters, respectively. Based on the clusterings and the calculated weights, a weighted consensus (co-association) matrix is constructed, which is subjected to Cluster-based Similariry Partitioning (CSPA; e.g. hierarchical clustering) or spectral clustering to resolve the consensus clustering.

Documentation

See pyckmeans' RTD Documentation for details.

Installation

pyckmeans can be installed using pip, Conda, or from source.

pip

pip install pyckmeans

Conda

conda install pyckmeans -c TankredO

From Source

The installation from source requires git and a c++ compiler.

git clone https://github.com/TankredO/pyckmeans
cd pyckmeans
pip install .

Usage

Examples using the Python API:

Consensus K-Means: Clustering a Data Matrix (Single K)

from pyckmeans import CKmeans

# simulate dataset
# 50 samples, 2 features, 3 true clusters
import sklearn.datasets
x, _ = sklearn.datasets.make_blobs(n_samples=50, n_features=2, centers=3, random_state=75)

# apply Consensus K-Means
# 3 clusters, 100 K-Means runs,
# draw 80% of samples and 50% of features for each single K-Means
ckm = CKmeans(k=3, n_rep=100, p_samp=0.8, p_feat=0.5)
ckm.fit(x)
ckm_res = ckm.predict(x)

# plot consensus matrix and consensus clustering
fig = ckm_res.plot(figsize=(7,7))

# consensus matrix
ckm_res.cmatrix

# clustering metrics
print('Bayesian Information Criterion:', ckm_res.bic)
print('Davies-Bouldin Index:', ckm_res.db)
print('Silhouette Score:', ckm_res.sil)
print('Calinski-Harabasz Index:', ckm_res.ch)

# consensus clusters
print('Cluster Membership:', ckm_res.cl)
Bayesian Information Criterion: 50.21824821939818
Davies-Bouldin Index: 0.2893792767901513
Silhouette Score: 0.7827738719266039
Calinski-Harabasz Index: 630.8235586596012
Cluster Membership: [0 2 1 0 2 2 1 0 2 1 0 0 2 0 2 2 1 1 1 1 0 1 2 2 2 2 1 0 2 2 1 0 1 1 0 0 0
 1 0 1 2 1 2 2 1 0 0 0 0 1]

png

Consensus K-Means: Clustering a Data Matrix (Multi K)

The MultiCKmeans class allows to train multiple CKmeans objects a once. This is, for example, useful for exploring clustering for different values of k.

from pyckmeans import MultiCKMeans
import sklearn.datasets

# simulate dataset
# 50 samples, 10 features, 3 true clusters
x, _ = sklearn.datasets.make_blobs(n_samples=50, n_features=10, centers=3, random_state=44)

# apply multiple Consensus K-Means for
# k = 2, ..., 5
# 100 K-Means runs per Consensus K-Means
# draw 80% of the sample for each single K-Means
# draw 50% of the features for each single K-Means
mckm = MultiCKMeans(k=[2, 3, 4, 5], n_rep=100, p_samp=0.8, p_feat=0.5)
mckm.fit(x)
mckm_res = mckm.predict(x)

# clustering metrics
print('Metrics:')
print(mckm_res.metrics)

# plot clustering metrics against k
# BIC, DB: lower is better
# SIL, CH: higher is better
mckm_res.plot_metrics(figsize=(10,5))


# get a single CKmeansResult                  0 |1| 2  3
ckm_res_k3 = mckm_res.ckmeans_results[1] # k=[2, 3, 4, 5]
# ...
# see "Clustering a Data Matrix (Single K)"
Metrics:
   k       sil         bic        db          ch
0  2  0.574369  225.092100  0.646401   59.733498
1  3  0.788207  126.358519  0.302979  387.409107
2  4  0.563343  126.979355  1.214520  271.019424
3  5  0.339466  128.061382  1.698652  211.080143

png

Consensus K-Means: Clustering Sequence Data

from pyckmeans import MultiCKMeans, NucleotideAlignment, pcoa
from IPython.display import display
# Set random seed for demonstration
import numpy
numpy.random.seed(0)

# Load nucleotide alignment
# Note: the file is available from
# "https://github.com/TankredO/pyckmeans/tree/main/docs/datasets/rhodanthemum_ct85_msl68.snps.phy"
aln = NucleotideAlignment.from_file('datasets/rhodanthemum_ct85_msl68.snps.phy')
print('Nucleotide alignment:', aln)

# Calculate Kimura 2-parameter distances
dst = aln.distance(distance_type='k2p')

# Apply PCoA, including negative Eigentvalue correction
pcoa_res = pcoa(dst, correction='lingoes')
# display Eigenvalues
print('Eigenvalues:')
display(pcoa_res.values)

# Get Eigenvectors until the cumulative corrected Eigenvalues are >= 0.8
vectors = pcoa_res.get_vectors(
    filter_by='eigvals_rel_corrected_cum',
    filter_th=0.8,
    out_format='pandas'
)

# Apply Multi-K Consensus K-Means
mckm = MultiCKMeans(
    k=range(2, 20),
    n_rep=50,
    p_samp=0.8,
    p_feat=0.8
)
mckm.fit(vectors)
mckm_res = mckm.predict(vectors)
mckm_res.plot_metrics(figsize=(12, 7))

# Select a 'good' K
# At k values around 7, BIC, DB, and SIL have a (local) optimum
ckm_res_k7 = mckm_res.ckmeans_results[5]
fig = ckm_res_k7.plot(figsize=(14,14))
Nucleotide alignment: <NucleotideAlignment; #samples: 108, #sites: 6752>
Eigenvalues:
eigvals eigvals_rel eigvals_rel_cum eigvals_rel_corrected eigvals_rel_corrected_cum
0 0.115972 0.471458 0.233986 0.233986 0.233986
1 0.039585 0.160924 0.317016 0.083030 0.317016
2 0.035079 0.142604 0.391140 0.074125 0.391140
3 0.017383 0.070665 0.430295 0.039154 0.430295
4 0.009831 0.039965 0.454525 0.024230 0.454525
... ... ... ... ... ...
103 -0.001325 -0.005388 0.998575 0.001457 0.998575
104 -0.001693 -0.006881 0.999654 0.001079 0.999654
105 -0.001884 -0.007660 1.000000 0.000346 1.000000
106 -0.002255 -0.009168 1.000000 0.000000 1.000000
107 -0.002430 -0.009880 1.000000 0.000000 1.000000

108 rows × 5 columns

png

png

WECR K-Means: Clustering Sequence Data

from pyckmeans import WECR, NucleotideAlignment, pcoa

# Load nucleotide alignment
aln = NucleotideAlignment.from_file('datasets/rhodanthemum_ct85_msl68.snps.phy')

# Calculate Kimura 2-parameter distances
dst = aln.distance(distance_type='k2p')

# Apply PCoA, including negative Eigentvalue correction
pcoa_res = pcoa(dst, correction='lingoes')

# Get Eigenvectors until the cumulative corrected Eigenvalues are >= 0.8
vectors = pcoa_res.get_vectors(
    filter_by='eigvals_rel_corrected_cum',
    filter_th=0.8,
    out_format='pandas'
)

# Apply WECR K-Means
wecr = WECR(
    k=range(2, 20),
    n_rep=1000,
    p_samp=0.6,
    p_feat=0.6,
)
wecr.fit(vectors)
wecr_res = wecr.predict(vectors)

# Plot clustering metrics for each k
wecr_res.plot_metrics(figsize=(12, 7))

# Select a 'good' K (e.g., 6, 7, 8) for the consensus clustering
wecr_res.plot(k=6, figsize=(14,14))

cluster_membership = wecr_res.get_cl(k=6, with_names=True)
print('cluster_membership:')
print(cluster_membership)
cluster_membership:
PP-R002-01         0
PP-R002-01-dupl    0
PP-R017-04         4
PP-R017-04-dupl    4
PP-R019-01         5
                  ..
R044-02            3
R044-12            3
R045-02            0
R045-06            0
R045-25            0
Length: 108, dtype: int32

png

png

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyckmeans-0.9.4.tar.gz (49.7 kB view details)

Uploaded Source

Built Distributions

pyckmeans-0.9.4-cp39-cp39-win_amd64.whl (66.8 kB view details)

Uploaded CPython 3.9 Windows x86-64

pyckmeans-0.9.4-cp39-cp39-macosx_10_15_x86_64.whl (58.0 kB view details)

Uploaded CPython 3.9 macOS 10.15+ x86-64

pyckmeans-0.9.4-cp39-cp39-macosx_10_9_x86_64.whl (57.3 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

pyckmeans-0.9.4-cp38-cp38-win_amd64.whl (66.8 kB view details)

Uploaded CPython 3.8 Windows x86-64

pyckmeans-0.9.4-cp38-cp38-macosx_10_14_x86_64.whl (57.9 kB view details)

Uploaded CPython 3.8 macOS 10.14+ x86-64

pyckmeans-0.9.4-cp37-cp37m-win_amd64.whl (66.8 kB view details)

Uploaded CPython 3.7m Windows x86-64

pyckmeans-0.9.4-cp37-cp37m-macosx_10_14_x86_64.whl (57.9 kB view details)

Uploaded CPython 3.7m macOS 10.14+ x86-64

pyckmeans-0.9.4-cp36-cp36m-win_amd64.whl (66.8 kB view details)

Uploaded CPython 3.6m Windows x86-64

pyckmeans-0.9.4-cp36-cp36m-macosx_10_14_x86_64.whl (57.9 kB view details)

Uploaded CPython 3.6m macOS 10.14+ x86-64

File details

Details for the file pyckmeans-0.9.4.tar.gz.

File metadata

  • Download URL: pyckmeans-0.9.4.tar.gz
  • Upload date:
  • Size: 49.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.5

File hashes

Hashes for pyckmeans-0.9.4.tar.gz
Algorithm Hash digest
SHA256 5bc772863ca9419d2d1e80ba10b5668aa6966ce0720c1a5a651da109a8dabf35
MD5 1b826df77f437cf52d25e6ad04cfbc51
BLAKE2b-256 e6568810675f639f892a066df726a35349c8c48556ce06f307d7aa1d3b9f3312

See more details on using hashes here.

File details

Details for the file pyckmeans-0.9.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: pyckmeans-0.9.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 66.8 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for pyckmeans-0.9.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 f12325aeffe0903fad9c72087f70c528e4a6c928a788a102b2b027616e0fccf1
MD5 4776108f8a70536e0dcb08574f891b92
BLAKE2b-256 6c4f444c4134e988ea9f0b1279bebbeefcf9189efaac456f6ac06fbcb0720c0c

See more details on using hashes here.

File details

Details for the file pyckmeans-0.9.4-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for pyckmeans-0.9.4-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 8b9145220db9349c57a5f6c2c359ef51e7e84299aa6225d68752e4c3c4a571fa
MD5 730da5646dffa1298e8ca56479865087
BLAKE2b-256 1ab1b70b80033724628c75955d517ba3ed026a4ef4bdc614baf79e128697ed65

See more details on using hashes here.

File details

Details for the file pyckmeans-0.9.4-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyckmeans-0.9.4-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 14f7b48a7897a8eeb3d583d4e289e9f9ec6598d45a0c453b08df745121473e1f
MD5 ef8772b2e48f5e4c4853aab048e4e440
BLAKE2b-256 03624bedc054a003793ef25f71835ca10c829f00903e265c1cc379e24b363b0d

See more details on using hashes here.

File details

Details for the file pyckmeans-0.9.4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: pyckmeans-0.9.4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 66.8 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.10

File hashes

Hashes for pyckmeans-0.9.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 da2d4f8757a5c7565f6958fdd5ac9744c81ad5cd65f409118d1e3a126858f86b
MD5 a4871b4fd4c5885fd847cbd968dffce1
BLAKE2b-256 d1591b14722149685ddebab816dc6c695b0c2ec0fc62537cc986825f0ef6482e

See more details on using hashes here.

File details

Details for the file pyckmeans-0.9.4-cp38-cp38-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for pyckmeans-0.9.4-cp38-cp38-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 a165ed69348134fb849483ac0b0c94124503f64cccce78cf6ee70f36321df4d3
MD5 486db9f67e56a332f967adab17a1ad82
BLAKE2b-256 0473a2b5de8aadd7d4b78901de302050e17d64f07420c3ba6c8bbf62171fb678

See more details on using hashes here.

File details

Details for the file pyckmeans-0.9.4-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for pyckmeans-0.9.4-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 7ffecfccacf4655bcd567e7f2129083fe88295443b82886489328b202e3149b8
MD5 fd72f82ba6c8d107e929aa1a9aed30e8
BLAKE2b-256 cb54c646628fa74cd02e80bb19750ec565e1f692079fd882e3482e8c67d7c0ce

See more details on using hashes here.

File details

Details for the file pyckmeans-0.9.4-cp37-cp37m-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for pyckmeans-0.9.4-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 75b7d0fe11bf2a2a1cde7d3e09b8b9510c9a63ff375083604edcbfc59190e826
MD5 cc988b820c0dab10a6b8b1aebd4f0201
BLAKE2b-256 a989193fdd29c4ecff54b6d70ab1bde90d2b86c942cac24a62c26bc96e82ae5e

See more details on using hashes here.

File details

Details for the file pyckmeans-0.9.4-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: pyckmeans-0.9.4-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 66.8 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.8

File hashes

Hashes for pyckmeans-0.9.4-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 2cc13b59aecd6f3ddf29deead44a63f863d11267ba4ba86f0605c41e99c19b90
MD5 20992b7a11c4e01d7ed13c711b20aef2
BLAKE2b-256 63f962a2d88ff45bae6b71c4b88c059912504deccc87996cc87e8bc54a789194

See more details on using hashes here.

File details

Details for the file pyckmeans-0.9.4-cp36-cp36m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: pyckmeans-0.9.4-cp36-cp36m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 57.9 kB
  • Tags: CPython 3.6m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.15

File hashes

Hashes for pyckmeans-0.9.4-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 872bcfb93424274e24faff7b21da9b610e5c0ee99ce7a7a7278ae1e04c55475d
MD5 b594a5bb1c98d719b4b9b00307d3ee4f
BLAKE2b-256 2ebdcd7b2135816d85df152ea9d6d9e0d8a06b95378ad1c377b3b9c0150fe217

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page