A consensus K-Means implementation.
Project description
pyckmeans
pyckmeans is a Python package for Consensus K-Means and Weighted Ensemble Consensus of Random (WECR) K-Means clustering, especially in the context of DNA sequence data. To evaluate the quality of clusterings, pyckmeans implements several internal validation metrics.
In addition to the clustering functionality, it provides tools for working with DNA sequence data such as reading and writing of DNA alignment files, calculating genetic distances, and Principle Coordinate Analysis (PCoA) for dimensionality reduction.
Consensus K-Means
Consensus K-Means is an unsupervised ensemble clustering algorithm, combining multiple K-Means clusterings, where each K-Means is trained on a subset of the data (random subset) and a subset of the the features (random subspace). The predicted cluster memberships of the single clusterings are combined to a consensus (or co-association) matrix, determining the number of times each pair of samples was clustered together over all clusterings. This matrix can be interpreted as similarity matrix and can be used to resolve the final consensus clustering by subjecting it to a last clustering step, e.g. hierarchical, or spectral clustering.
WECR K-Means
Weighted Ensemble Consensus of Random (WECR) K-Means is a semi-supervised ensemble clustering algorithm. Similar to consensus K-Means, it is based on a collection of K-Means clusterings, which are each trained on a random subset of data and a random subspace of features. In addition, for each single clustering the number of clusters k is also randomized. This library of clusterings is subjected to weighting function that integrates user-supplied must-link and must-not-link constraints, as well as an internal cluster validation criterion. The constraints represent the semi-supervised component of WECR K-Means: the user can provide prior knowledge considering the composition of the clusters. Must-link and must-not-link constraints imply that a pair of samples (observations, data points) is expected to be found in the same or different clusters, respectively. Based on the clusterings and the calculated weights, a weighted consensus (co-association) matrix is constructed, which is subjected to Cluster-based Similariry Partitioning (CSPA; e.g. hierarchical clustering) or spectral clustering to resolve the consensus clustering.
Documentation
See pyckmeans' RTD Documentation for details.
Installation
pyckmeans can be installed using pip, Conda, or from source.
pip
pip install pyckmeans
Conda
conda install pyckmeans -c TankredO
From Source
The installation from source requires git
and a c++ compiler.
git clone https://github.com/TankredO/pyckmeans
cd pyckmeans
pip install .
Usage
Examples using the Python API:
- Consensus K-Means: Clustering a Data Matrix (Single K)
- Consensus K-Means: Clustering a Data Matrix (Multi K)
- Consensus K-Means: Clustering Sequence Data
- WECR K-Means: Clustering Sequence Data
Consensus K-Means: Clustering a Data Matrix (Single K)
from pyckmeans import CKmeans
# simulate dataset
# 50 samples, 2 features, 3 true clusters
import sklearn.datasets
x, _ = sklearn.datasets.make_blobs(n_samples=50, n_features=2, centers=3, random_state=75)
# apply Consensus K-Means
# 3 clusters, 100 K-Means runs,
# draw 80% of samples and 50% of features for each single K-Means
ckm = CKmeans(k=3, n_rep=100, p_samp=0.8, p_feat=0.5)
ckm.fit(x)
ckm_res = ckm.predict(x)
# plot consensus matrix and consensus clustering
fig = ckm_res.plot(figsize=(7,7))
# consensus matrix
ckm_res.cmatrix
# clustering metrics
print('Bayesian Information Criterion:', ckm_res.bic)
print('Davies-Bouldin Index:', ckm_res.db)
print('Silhouette Score:', ckm_res.sil)
print('Calinski-Harabasz Index:', ckm_res.ch)
# consensus clusters
print('Cluster Membership:', ckm_res.cl)
Bayesian Information Criterion: 50.21824821939818
Davies-Bouldin Index: 0.2893792767901513
Silhouette Score: 0.7827738719266039
Calinski-Harabasz Index: 630.8235586596012
Cluster Membership: [0 2 1 0 2 2 1 0 2 1 0 0 2 0 2 2 1 1 1 1 0 1 2 2 2 2 1 0 2 2 1 0 1 1 0 0 0
1 0 1 2 1 2 2 1 0 0 0 0 1]
Consensus K-Means: Clustering a Data Matrix (Multi K)
The MultiCKmeans
class allows to train multiple CKmeans
objects a once.
This is, for example, useful for exploring clustering for different values of k.
from pyckmeans import MultiCKMeans
import sklearn.datasets
# simulate dataset
# 50 samples, 10 features, 3 true clusters
x, _ = sklearn.datasets.make_blobs(n_samples=50, n_features=10, centers=3, random_state=44)
# apply multiple Consensus K-Means for
# k = 2, ..., 5
# 100 K-Means runs per Consensus K-Means
# draw 80% of the sample for each single K-Means
# draw 50% of the features for each single K-Means
mckm = MultiCKMeans(k=[2, 3, 4, 5], n_rep=100, p_samp=0.8, p_feat=0.5)
mckm.fit(x)
mckm_res = mckm.predict(x)
# clustering metrics
print('Metrics:')
print(mckm_res.metrics)
# plot clustering metrics against k
# BIC, DB: lower is better
# SIL, CH: higher is better
mckm_res.plot_metrics(figsize=(10,5))
# get a single CKmeansResult 0 |1| 2 3
ckm_res_k3 = mckm_res.ckmeans_results[1] # k=[2, 3, 4, 5]
# ...
# see "Clustering a Data Matrix (Single K)"
Metrics:
k sil bic db ch
0 2 0.574369 225.092100 0.646401 59.733498
1 3 0.788207 126.358519 0.302979 387.409107
2 4 0.563343 126.979355 1.214520 271.019424
3 5 0.339466 128.061382 1.698652 211.080143
Consensus K-Means: Clustering Sequence Data
from pyckmeans import MultiCKMeans, NucleotideAlignment, pcoa
from IPython.display import display
# Set random seed for demonstration
import numpy
numpy.random.seed(0)
# Load nucleotide alignment
# Note: the file is available from
# "https://github.com/TankredO/pyckmeans/tree/main/docs/datasets/rhodanthemum_ct85_msl68.snps.phy"
aln = NucleotideAlignment.from_file('datasets/rhodanthemum_ct85_msl68.snps.phy')
print('Nucleotide alignment:', aln)
# Calculate Kimura 2-parameter distances
dst = aln.distance(distance_type='k2p')
# Apply PCoA, including negative Eigentvalue correction
pcoa_res = pcoa(dst, correction='lingoes')
# display Eigenvalues
print('Eigenvalues:')
display(pcoa_res.values)
# Get Eigenvectors until the cumulative corrected Eigenvalues are >= 0.8
vectors = pcoa_res.get_vectors(
filter_by='eigvals_rel_corrected_cum',
filter_th=0.8,
out_format='pandas'
)
# Apply Multi-K Consensus K-Means
mckm = MultiCKMeans(
k=range(2, 20),
n_rep=50,
p_samp=0.8,
p_feat=0.8
)
mckm.fit(vectors)
mckm_res = mckm.predict(vectors)
mckm_res.plot_metrics(figsize=(12, 7))
# Select a 'good' K
# At k values around 7, BIC, DB, and SIL have a (local) optimum
ckm_res_k7 = mckm_res.ckmeans_results[5]
fig = ckm_res_k7.plot(figsize=(14,14))
Nucleotide alignment: <NucleotideAlignment; #samples: 108, #sites: 6752>
Eigenvalues:
eigvals | eigvals_rel | eigvals_rel_cum | eigvals_rel_corrected | eigvals_rel_corrected_cum | |
---|---|---|---|---|---|
0 | 0.115972 | 0.471458 | 0.233986 | 0.233986 | 0.233986 |
1 | 0.039585 | 0.160924 | 0.317016 | 0.083030 | 0.317016 |
2 | 0.035079 | 0.142604 | 0.391140 | 0.074125 | 0.391140 |
3 | 0.017383 | 0.070665 | 0.430295 | 0.039154 | 0.430295 |
4 | 0.009831 | 0.039965 | 0.454525 | 0.024230 | 0.454525 |
... | ... | ... | ... | ... | ... |
103 | -0.001325 | -0.005388 | 0.998575 | 0.001457 | 0.998575 |
104 | -0.001693 | -0.006881 | 0.999654 | 0.001079 | 0.999654 |
105 | -0.001884 | -0.007660 | 1.000000 | 0.000346 | 1.000000 |
106 | -0.002255 | -0.009168 | 1.000000 | 0.000000 | 1.000000 |
107 | -0.002430 | -0.009880 | 1.000000 | 0.000000 | 1.000000 |
108 rows × 5 columns
WECR K-Means: Clustering Sequence Data
from pyckmeans import WECR, NucleotideAlignment, pcoa
# Load nucleotide alignment
aln = NucleotideAlignment.from_file('datasets/rhodanthemum_ct85_msl68.snps.phy')
# Calculate Kimura 2-parameter distances
dst = aln.distance(distance_type='k2p')
# Apply PCoA, including negative Eigentvalue correction
pcoa_res = pcoa(dst, correction='lingoes')
# Get Eigenvectors until the cumulative corrected Eigenvalues are >= 0.8
vectors = pcoa_res.get_vectors(
filter_by='eigvals_rel_corrected_cum',
filter_th=0.8,
out_format='pandas'
)
# Apply WECR K-Means
wecr = WECR(
k=range(2, 20),
n_rep=1000,
p_samp=0.6,
p_feat=0.6,
)
wecr.fit(vectors)
wecr_res = wecr.predict(vectors)
# Plot clustering metrics for each k
wecr_res.plot_metrics(figsize=(12, 7))
# Select a 'good' K (e.g., 6, 7, 8) for the consensus clustering
wecr_res.plot(k=6, figsize=(14,14))
cluster_membership = wecr_res.get_cl(k=6, with_names=True)
print('cluster_membership:')
print(cluster_membership)
cluster_membership:
PP-R002-01 0
PP-R002-01-dupl 0
PP-R017-04 4
PP-R017-04-dupl 4
PP-R019-01 5
..
R044-02 3
R044-12 3
R045-02 0
R045-06 0
R045-25 0
Length: 108, dtype: int32
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file pyckmeans-0.9.4.tar.gz
.
File metadata
- Download URL: pyckmeans-0.9.4.tar.gz
- Upload date:
- Size: 49.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5bc772863ca9419d2d1e80ba10b5668aa6966ce0720c1a5a651da109a8dabf35 |
|
MD5 | 1b826df77f437cf52d25e6ad04cfbc51 |
|
BLAKE2b-256 | e6568810675f639f892a066df726a35349c8c48556ce06f307d7aa1d3b9f3312 |
File details
Details for the file pyckmeans-0.9.4-cp39-cp39-win_amd64.whl
.
File metadata
- Download URL: pyckmeans-0.9.4-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 66.8 kB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f12325aeffe0903fad9c72087f70c528e4a6c928a788a102b2b027616e0fccf1 |
|
MD5 | 4776108f8a70536e0dcb08574f891b92 |
|
BLAKE2b-256 | 6c4f444c4134e988ea9f0b1279bebbeefcf9189efaac456f6ac06fbcb0720c0c |
File details
Details for the file pyckmeans-0.9.4-cp39-cp39-macosx_10_15_x86_64.whl
.
File metadata
- Download URL: pyckmeans-0.9.4-cp39-cp39-macosx_10_15_x86_64.whl
- Upload date:
- Size: 58.0 kB
- Tags: CPython 3.9, macOS 10.15+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b9145220db9349c57a5f6c2c359ef51e7e84299aa6225d68752e4c3c4a571fa |
|
MD5 | 730da5646dffa1298e8ca56479865087 |
|
BLAKE2b-256 | 1ab1b70b80033724628c75955d517ba3ed026a4ef4bdc614baf79e128697ed65 |
File details
Details for the file pyckmeans-0.9.4-cp39-cp39-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: pyckmeans-0.9.4-cp39-cp39-macosx_10_9_x86_64.whl
- Upload date:
- Size: 57.3 kB
- Tags: CPython 3.9, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14f7b48a7897a8eeb3d583d4e289e9f9ec6598d45a0c453b08df745121473e1f |
|
MD5 | ef8772b2e48f5e4c4853aab048e4e440 |
|
BLAKE2b-256 | 03624bedc054a003793ef25f71835ca10c829f00903e265c1cc379e24b363b0d |
File details
Details for the file pyckmeans-0.9.4-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: pyckmeans-0.9.4-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 66.8 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | da2d4f8757a5c7565f6958fdd5ac9744c81ad5cd65f409118d1e3a126858f86b |
|
MD5 | a4871b4fd4c5885fd847cbd968dffce1 |
|
BLAKE2b-256 | d1591b14722149685ddebab816dc6c695b0c2ec0fc62537cc986825f0ef6482e |
File details
Details for the file pyckmeans-0.9.4-cp38-cp38-macosx_10_14_x86_64.whl
.
File metadata
- Download URL: pyckmeans-0.9.4-cp38-cp38-macosx_10_14_x86_64.whl
- Upload date:
- Size: 57.9 kB
- Tags: CPython 3.8, macOS 10.14+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a165ed69348134fb849483ac0b0c94124503f64cccce78cf6ee70f36321df4d3 |
|
MD5 | 486db9f67e56a332f967adab17a1ad82 |
|
BLAKE2b-256 | 0473a2b5de8aadd7d4b78901de302050e17d64f07420c3ba6c8bbf62171fb678 |
File details
Details for the file pyckmeans-0.9.4-cp37-cp37m-win_amd64.whl
.
File metadata
- Download URL: pyckmeans-0.9.4-cp37-cp37m-win_amd64.whl
- Upload date:
- Size: 66.8 kB
- Tags: CPython 3.7m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ffecfccacf4655bcd567e7f2129083fe88295443b82886489328b202e3149b8 |
|
MD5 | fd72f82ba6c8d107e929aa1a9aed30e8 |
|
BLAKE2b-256 | cb54c646628fa74cd02e80bb19750ec565e1f692079fd882e3482e8c67d7c0ce |
File details
Details for the file pyckmeans-0.9.4-cp37-cp37m-macosx_10_14_x86_64.whl
.
File metadata
- Download URL: pyckmeans-0.9.4-cp37-cp37m-macosx_10_14_x86_64.whl
- Upload date:
- Size: 57.9 kB
- Tags: CPython 3.7m, macOS 10.14+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.7.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 75b7d0fe11bf2a2a1cde7d3e09b8b9510c9a63ff375083604edcbfc59190e826 |
|
MD5 | cc988b820c0dab10a6b8b1aebd4f0201 |
|
BLAKE2b-256 | a989193fdd29c4ecff54b6d70ab1bde90d2b86c942cac24a62c26bc96e82ae5e |
File details
Details for the file pyckmeans-0.9.4-cp36-cp36m-win_amd64.whl
.
File metadata
- Download URL: pyckmeans-0.9.4-cp36-cp36m-win_amd64.whl
- Upload date:
- Size: 66.8 kB
- Tags: CPython 3.6m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2cc13b59aecd6f3ddf29deead44a63f863d11267ba4ba86f0605c41e99c19b90 |
|
MD5 | 20992b7a11c4e01d7ed13c711b20aef2 |
|
BLAKE2b-256 | 63f962a2d88ff45bae6b71c4b88c059912504deccc87996cc87e8bc54a789194 |
File details
Details for the file pyckmeans-0.9.4-cp36-cp36m-macosx_10_14_x86_64.whl
.
File metadata
- Download URL: pyckmeans-0.9.4-cp36-cp36m-macosx_10_14_x86_64.whl
- Upload date:
- Size: 57.9 kB
- Tags: CPython 3.6m, macOS 10.14+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 872bcfb93424274e24faff7b21da9b610e5c0ee99ce7a7a7278ae1e04c55475d |
|
MD5 | b594a5bb1c98d719b4b9b00307d3ee4f |
|
BLAKE2b-256 | 2ebdcd7b2135816d85df152ea9d6d9e0d8a06b95378ad1c377b3b9c0150fe217 |