A comprehensive clustering toolkit with advanced tree cutting and visualization
Project description
skclust
A comprehensive clustering toolkit with advanced tree cutting, visualization, and network analysis capabilities.
Warning: This is a beta release and has not been thoroughly tested.
Features
- Scikit-learn compatible API for seamless integration
- Multiple linkage methods (Ward, Complete, Average, Single, etc.)
- Advanced tree cutting with dynamic, height-based, and max-cluster methods
- Rich visualizations with dendrograms and metadata tracks
- Network analysis with connectivity metrics and NetworkX integration
- Tree export in Newick format for phylogenetic analysis
- Distance matrix support for precomputed distances
- Metadata tracks for biological and experimental annotations
Installation
pip install skclust
Quick Start
Hierarchical Clustering
import pandas as pd
import numpy as np
from sklearn.datasets import make_blobs
from skclust import HierarchicalClustering
# Generate sample data
X, y = make_blobs(n_samples=100, centers=4, random_state=42)
X_df = pd.DataFrame(X, columns=['feature_1', 'feature_2'])
# Perform hierarchical clustering
hc = HierarchicalClustering(
method='ward',
cut_method='dynamic',
min_cluster_size=5
)
# Fit and get cluster labels
labels = hc.fit_transform(X_df)
print(f"Found {hc.n_clusters_} clusters")
# Plot dendrogram with clusters
fig, axes = hc.plot(figsize=(12, 6), show_clusters=True)
Representative Sampling
from skclust import KMeansRepresentativeSampler
# Create representative test set (10% of data)
sampler = KMeansRepresentativeSampler(
sampling_size=0.1,
stratify=True, # Maintain class proportions
method='minibatch'
)
# Get train/test split
X_train, X_test, y_train, y_test = sampler.fit(X_df, y).get_train_test_split(X_df, y)
print(f"Train set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples ({len(X_test)/len(X_df)*100:.1f}%)")
Advanced Usage
Adding Metadata Tracks
# Add continuous metadata track
sample_scores = pd.Series(np.random.randn(100), index=X_df.index)
hc.add_track('Quality Score', sample_scores, track_type='continuous')
# Add categorical metadata track
sample_groups = pd.Series(['A', 'B', 'C'] * 34, index=X_df.index[:100])
hc.add_track('Group', sample_groups, track_type='categorical')
# Plot with metadata tracks
fig, axes = hc.plot(show_tracks=True, figsize=(12, 8))
Custom Tree Cutting
# Cut by height
hc_height = HierarchicalClustering(
method='ward',
cut_method='height',
cut_threshold=50.0
)
labels_height = hc_height.fit_transform(X_df)
# Cut by number of clusters
hc_maxclust = HierarchicalClustering(
method='complete',
cut_method='maxclust',
cut_threshold=5 # Force exactly 5 clusters
)
labels_maxclust = hc_maxclust.fit_transform(X_df)
Distance Matrix Input
from scipy.spatial.distance import pdist, squareform
# Compute custom distance matrix
distances = pdist(X_df, metric='cosine')
distance_matrix = pd.DataFrame(squareform(distances),
index=X_df.index,
columns=X_df.index)
# Cluster using precomputed distances
hc_custom = HierarchicalClustering(method='average')
labels_custom = hc_custom.fit_transform(distance_matrix)
Stratified Representative Sampling
# Enhanced stratified sampling with minority class boosting
sampler_enhanced = KMeansRepresentativeSampler(
sampling_size=0.15,
stratify=True,
coverage_boost=2.0, # Boost minority classes
min_clusters_per_class=3, # Ensure minimum representation
method='kmeans'
)
X_train, X_test, y_train, y_test = sampler_enhanced.fit(X_df, y).get_train_test_split(X_df, y)
# Check class balance preservation
print("Original class distribution:")
print(pd.Series(y).value_counts().sort_index())
print("\nTest set class distribution:")
print(pd.Series(y_test).value_counts().sort_index())
API Reference
HierarchicalClustering
Parameters:
method: Linkage method ('ward', 'complete', 'average', 'single', 'centroid', 'median', 'weighted')metric: Distance metric for computing pairwise distancescut_method: Tree cutting method ('dynamic', 'height', 'maxclust')min_cluster_size: Minimum cluster size for dynamic cuttingdeep_split: Deep split parameter for dynamic cutting (0-4)cut_threshold: Threshold for height/maxclust cuttingcluster_prefix: String prefix for cluster labels (e.g., "C" → "C1", "C2")
Key Methods:
fit(X): Fit hierarchical clustering to datatransform(): Return cluster labelsadd_track(name, data, track_type): Add metadata track for visualizationplot(): Generate dendrogram with optional tracks and clusterssummary(): Print clustering summary statistics
KMeansRepresentativeSampler
Parameters:
sampling_size: Proportion of data for test set (0.0-1.0)stratify: Whether to maintain class proportionsmethod: Clustering method ('minibatch', 'kmeans')coverage_boost: Boost factor for minority classes (>1.0)min_clusters_per_class: Minimum clusters per classbatch_size: Batch size for MiniBatchKMeans
Key Methods:
fit(X, y): Fit sampler and identify representativestransform(X): Return representative samplesget_train_test_split(X, y): Get train/test split
Examples with Real Data
Iris Dataset
from sklearn.datasets import load_iris
# Load iris dataset
iris = load_iris()
X_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
y_iris = pd.Series(iris.target, name='species')
# Hierarchical clustering
hc_iris = HierarchicalClustering(
method='ward',
cut_method='dynamic',
min_cluster_size=10,
cluster_prefix='Cluster_'
)
clusters = hc_iris.fit_transform(X_iris)
# Add species information as track
species_names = pd.Series([iris.target_names[i] for i in y_iris], index=X_iris.index)
hc_iris.add_track('True Species', species_names, track_type='categorical')
# Plot results
fig, axes = hc_iris.plot(show_clusters=True, show_tracks=True, figsize=(15, 8))
Creating Balanced Test Sets
# Create representative test set maintaining species balance
sampler_iris = KMeansRepresentativeSampler(
sampling_size=0.2, # 20% test set
stratify=True,
coverage_boost=1.0, # Equal representation
method='kmeans',
random_state=42
)
X_train, X_test, y_train, y_test = sampler_iris.fit(X_iris, y_iris).get_train_test_split(X_iris, y_iris)
print(f"Train set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"Representative indices: {sampler_iris.representative_indices_[:10].tolist()}")
Dependencies
Required
- numpy
- pandas
- scikit-learn
- scipy
- matplotlib
- seaborn
- networkx
- loguru
Optional (for enhanced functionality)
- dynamicTreeCut (dynamic tree cutting)
- skbio (tree representations)
- fastcluster (faster linkage computation)
- ensemble_networkx (network analysis)
Author
Josh L. Espinoza
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Original Implementation
This package is based on the hierarchical clustering implementation originally developed in the Soothsayer framework:
Espinoza JL, Dupont CL, O'Rourke A, Beyhan S, Morales P, et al. (2021) Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach. PLOS Computational Biology 17(3): e1008857. https://doi.org/10.1371/journal.pcbi.1008857
The original implementation provided the foundation for the hierarchical clustering algorithms, metadata track visualization, and eigenprofile analysis features in this package.
Acknowledgments
- Built on top of scipy, scikit-learn, and networkx
- Original implementation developed in the Soothsayer framework
- Inspired by WGCNA and other biological clustering tools
- Dynamic tree cutting algorithms from the dynamicTreeCut package
Support
- Documentation: [Link to docs]
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Citation
If you use this package in your research, please cite:
Original Soothsayer implementation:
@article{espinoza2021predicting,
title={Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach},
author={Espinoza, Josh L and Dupont, Chris L and O'Rourke, Aubrie and Beyhan, Seherzada and Morales, Paula and others},
journal={PLOS Computational Biology},
volume={17},
number={3},
pages={e1008857},
year={2021},
publisher={Public Library of Science San Francisco, CA USA},
doi={10.1371/journal.pcbi.1008857},
url={https://doi.org/10.1371/journal.pcbi.1008857}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file skclust-2026.1.9.tar.gz.
File metadata
- Download URL: skclust-2026.1.9.tar.gz
- Upload date:
- Size: 28.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdd04c0f866287905c3bcc9e7a054ad476c2c5b81a8544f9a1fe0ba17ab448aa
|
|
| MD5 |
20f8e0e88c3c27f520915d9f28903ad7
|
|
| BLAKE2b-256 |
e25733f34594b7295c1aba70c9db7f66989e18fca4d434c41486ae11c68bb686
|