A comprehensive clustering toolkit with advanced tree cutting and visualization

These details have not been verified by PyPI

Project links

Project description

skclust

A comprehensive clustering toolkit with advanced tree cutting, visualization, and network analysis capabilities.

Beta Not Production Ready

Warning: This is a beta release and has not been thoroughly tested.

Features

Scikit-learn compatible API for seamless integration
Multiple linkage methods (Ward, Complete, Average, Single, etc.)
Advanced tree cutting with dynamic, height-based, and max-cluster methods
Rich visualizations with dendrograms and metadata tracks
Network analysis with connectivity metrics and NetworkX integration
Tree export in Newick format for phylogenetic analysis
Distance matrix support for precomputed distances
Metadata tracks for biological and experimental annotations

Installation

pip install skclust

Quick Start

Hierarchical Clustering

import pandas as pd
import numpy as np
from sklearn.datasets import make_blobs
from skclust import HierarchicalClustering

# Generate sample data
X, y = make_blobs(n_samples=100, centers=4, random_state=42)
X_df = pd.DataFrame(X, columns=['feature_1', 'feature_2'])

# Perform hierarchical clustering
hc = HierarchicalClustering(
    method='ward',
    cut_method='dynamic',
    min_cluster_size=5
)

# Fit and get cluster labels
labels = hc.fit_transform(X_df)
print(f"Found {hc.n_clusters_} clusters")

# Plot dendrogram with clusters
fig, axes = hc.plot(figsize=(12, 6), show_clusters=True)

Representative Sampling

from skclust import KMeansRepresentativeSampler

# Create representative test set (10% of data)
sampler = KMeansRepresentativeSampler(
    sampling_size=0.1,
    stratify=True,  # Maintain class proportions
    method='minibatch'
)

# Get train/test split
X_train, X_test, y_train, y_test = sampler.fit(X_df, y).get_train_test_split(X_df, y)

print(f"Train set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples ({len(X_test)/len(X_df)*100:.1f}%)")

Advanced Usage

Adding Metadata Tracks

# Add continuous metadata track
sample_scores = pd.Series(np.random.randn(100), index=X_df.index)
hc.add_track('Quality Score', sample_scores, track_type='continuous')

# Add categorical metadata track
sample_groups = pd.Series(['A', 'B', 'C'] * 34, index=X_df.index[:100])
hc.add_track('Group', sample_groups, track_type='categorical')

# Plot with metadata tracks
fig, axes = hc.plot(show_tracks=True, figsize=(12, 8))

Custom Tree Cutting

# Cut by height
hc_height = HierarchicalClustering(
    method='ward',
    cut_method='height',
    cut_threshold=50.0
)
labels_height = hc_height.fit_transform(X_df)

# Cut by number of clusters
hc_maxclust = HierarchicalClustering(
    method='complete',
    cut_method='maxclust',
    cut_threshold=5  # Force exactly 5 clusters
)
labels_maxclust = hc_maxclust.fit_transform(X_df)

Distance Matrix Input

from scipy.spatial.distance import pdist, squareform

# Compute custom distance matrix
distances = pdist(X_df, metric='cosine')
distance_matrix = pd.DataFrame(squareform(distances), 
                              index=X_df.index, 
                              columns=X_df.index)

# Cluster using precomputed distances
hc_custom = HierarchicalClustering(method='average')
labels_custom = hc_custom.fit_transform(distance_matrix)

Stratified Representative Sampling

# Enhanced stratified sampling with minority class boosting
sampler_enhanced = KMeansRepresentativeSampler(
    sampling_size=0.15,
    stratify=True,
    coverage_boost=2.0,  # Boost minority classes
    min_clusters_per_class=3,  # Ensure minimum representation
    method='kmeans'
)

X_train, X_test, y_train, y_test = sampler_enhanced.fit(X_df, y).get_train_test_split(X_df, y)

# Check class balance preservation
print("Original class distribution:")
print(pd.Series(y).value_counts().sort_index())
print("\nTest set class distribution:")
print(pd.Series(y_test).value_counts().sort_index())

API Reference

HierarchicalClustering

Parameters:

method: Linkage method ('ward', 'complete', 'average', 'single', 'centroid', 'median', 'weighted')
metric: Distance metric for computing pairwise distances
cut_method: Tree cutting method ('dynamic', 'height', 'maxclust')
min_cluster_size: Minimum cluster size for dynamic cutting
deep_split: Deep split parameter for dynamic cutting (0-4)
cut_threshold: Threshold for height/maxclust cutting
cluster_prefix: String prefix for cluster labels (e.g., "C" → "C1", "C2")

Key Methods:

fit(X): Fit hierarchical clustering to data
transform(): Return cluster labels
add_track(name, data, track_type): Add metadata track for visualization
plot(): Generate dendrogram with optional tracks and clusters
summary(): Print clustering summary statistics

KMeansRepresentativeSampler

Parameters:

sampling_size: Proportion of data for test set (0.0-1.0)
stratify: Whether to maintain class proportions
method: Clustering method ('minibatch', 'kmeans')
coverage_boost: Boost factor for minority classes (>1.0)
min_clusters_per_class: Minimum clusters per class
batch_size: Batch size for MiniBatchKMeans

Key Methods:

fit(X, y): Fit sampler and identify representatives
transform(X): Return representative samples
get_train_test_split(X, y): Get train/test split

Examples with Real Data

Iris Dataset

from sklearn.datasets import load_iris

# Load iris dataset
iris = load_iris()
X_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
y_iris = pd.Series(iris.target, name='species')

# Hierarchical clustering
hc_iris = HierarchicalClustering(
    method='ward',
    cut_method='dynamic',
    min_cluster_size=10,
    cluster_prefix='Cluster_'
)

clusters = hc_iris.fit_transform(X_iris)

# Add species information as track
species_names = pd.Series([iris.target_names[i] for i in y_iris], index=X_iris.index)
hc_iris.add_track('True Species', species_names, track_type='categorical')

# Plot results
fig, axes = hc_iris.plot(show_clusters=True, show_tracks=True, figsize=(15, 8))

Creating Balanced Test Sets

# Create representative test set maintaining species balance
sampler_iris = KMeansRepresentativeSampler(
    sampling_size=0.2,  # 20% test set
    stratify=True,
    coverage_boost=1.0,  # Equal representation
    method='kmeans',
    random_state=42
)

X_train, X_test, y_train, y_test = sampler_iris.fit(X_iris, y_iris).get_train_test_split(X_iris, y_iris)

print(f"Train set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"Representative indices: {sampler_iris.representative_indices_[:10].tolist()}")

Dependencies

Required

numpy
pandas
scikit-learn
scipy
matplotlib
seaborn
networkx
loguru

Optional (for enhanced functionality)

dynamicTreeCut (dynamic tree cutting)
skbio (tree representations)
fastcluster (faster linkage computation)
ensemble_networkx (network analysis)

Author

Josh L. Espinoza

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Original Implementation

This package is based on the hierarchical clustering implementation originally developed in the Soothsayer framework:

Espinoza JL, Dupont CL, O'Rourke A, Beyhan S, Morales P, et al. (2021) Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach. PLOS Computational Biology 17(3): e1008857. https://doi.org/10.1371/journal.pcbi.1008857

The original implementation provided the foundation for the hierarchical clustering algorithms, metadata track visualization, and eigenprofile analysis features in this package.

Acknowledgments

Built on top of scipy, scikit-learn, and networkx
Original implementation developed in the Soothsayer framework
Inspired by WGCNA and other biological clustering tools
Dynamic tree cutting algorithms from the dynamicTreeCut package

Support

Documentation: [Link to docs]
Issues: GitHub Issues
Discussions: GitHub Discussions

Citation

If you use this package in your research, please cite:

Original Soothsayer implementation:

@article{espinoza2021predicting,
  title={Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach},
  author={Espinoza, Josh L and Dupont, Chris L and O'Rourke, Aubrie and Beyhan, Seherzada and Morales, Paula and others},
  journal={PLOS Computational Biology},
  volume={17},
  number={3},
  pages={e1008857},
  year={2021},
  publisher={Public Library of Science San Francisco, CA USA},
  doi={10.1371/journal.pcbi.1008857},
  url={https://doi.org/10.1371/journal.pcbi.1008857}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2026.2.27

Feb 27, 2026

2026.2.26.post1

Feb 27, 2026

2026.2.26

Feb 26, 2026

2026.2.4

Feb 5, 2026

2026.1.24

Jan 26, 2026

This version

2026.1.9

Jan 12, 2026

2025.9.8

Sep 8, 2025

2025.8.12.post1

Aug 13, 2025

2025.8.12

Aug 12, 2025

2025.8.5

Aug 12, 2025

2025.7.26

Aug 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skclust-2026.1.9.tar.gz (28.5 kB view details)

Uploaded Jan 12, 2026 Source

File details

Details for the file skclust-2026.1.9.tar.gz.

File metadata

Download URL: skclust-2026.1.9.tar.gz
Upload date: Jan 12, 2026
Size: 28.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for skclust-2026.1.9.tar.gz
Algorithm	Hash digest
SHA256	`fdd04c0f866287905c3bcc9e7a054ad476c2c5b81a8544f9a1fe0ba17ab448aa`
MD5	`20f8e0e88c3c27f520915d9f28903ad7`
BLAKE2b-256	`e25733f34594b7295c1aba70c9db7f66989e18fca4d434c41486ae11c68bb686`

See more details on using hashes here.

skclust 2026.1.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

skclust

Features

Installation

Quick Start

Hierarchical Clustering

Representative Sampling

Advanced Usage

Adding Metadata Tracks

Custom Tree Cutting

Distance Matrix Input

Stratified Representative Sampling

API Reference

HierarchicalClustering

KMeansRepresentativeSampler

Examples with Real Data

Iris Dataset

Creating Balanced Test Sets

Dependencies

Required

Optional (for enhanced functionality)

Author

License

Original Implementation

Acknowledgments

Support

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes