Skip to main content

Python package for evaluating clustering stability through the use of repeated stochastic clustering and element-centric evaluation metrics.

Project description

crowflow

crowflow is a Python package designed for assessing clustering stability through repeated stochastic clustering. It is compatible with any clustering algorithm that outputs labels or implements a fit or fit_predict method, provided it includes stochasticity (i.e., allows setting a seed or random_state). By running clustering multiple times with different seeds, crowflow quantifies clustering consistency using element-centric similarity (ECS) and element-centric consistency (ECC), offering insights into the robustness and reproducibility of cluster assignments. The package enables users to optimize feature subsets, fine-tune clustering parameters, and evaluate clustering robustness against perturbations.

crowflow generalizes the ClustAssessPy package, which focuses on parameter selection for community-detection clustering in single-cell analysis. It extends this approach to any clustering task, enabling a data-driven identification of robust and reproducible clustering solutions across diverse applications.

Class Summaries

StochasticClusteringRunner

Runs a stochastic clustering algorithm multiple times with different random seeds and evaluates the stability of results using ECC. It identifies in an element-wise precision the stability of clustering results and provides a majority voting label.

Example

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from crowflow import StochasticClusteringRunner

np.random.seed(42)
df = pd.DataFrame(np.random.normal(size=(200, 10)), columns=[f"feature_{i+1}" for i in range(10)])
runner = StochasticClusteringRunner(KMeans, "random_state", n_runs=30, verbose=True, n_clusters=3)
results = runner.run(df)
print("Majority Voting Labels:", results["majority_voting_labels"])
print("ECC:", results["ecc"])

GeneticAlgorithmFeatureSelector

Uses a genetic algorithm to iteratively optimize feature selection for clustering stability. It repeatedly applies stochastic clustering with different feature subsets and evaluates stability using ECC. The algorithm evolves through selection, crossover, and mutation, converging on the feature set that maximizes clustering robustness.

Example

from crowflow import GeneticAlgorithmFeatureSelector

np.random.seed(42)
ga_fs = GeneticAlgorithmFeatureSelector(KMeans, "random_state", verbose=True, n_generations_no_change=5)
ga_results = ga_fs.run(df)
print("Best Features:", ga_results["best_features"])
print("Best ECC Score:", ga_results["best_ecc"])

ParameterOptimizer

Systematically tunes each hyperparameter separately by performing repeated clustering and evaluating stability using ECC.

Example

from crowflow import ParameterOptimizer

parameter_optimizer = ParameterOptimizer(KMeans, "random_state", {"n_clusters": np.arange(2, 5, 1)}, n_runs=30)
results_df, scr_results = parameter_optimizer.run(df)

ParameterSearcher

Evaluates all possible combinations (exhaustive grid search) of specified parameters running repeated clustering and computing ECC for each combination. The purpose is to find the configuration (set of hyperparameter values) that provides the most stable clustering results.

Example

from crowflow import ParameterSearcher

param_grid = {"n_clusters": np.arange(2, 5, 1), "init": ["k-means++", "random"]}
parameter_searcher = ParameterSearcher(KMeans, "random_state", param_grid, n_runs=30)
results_df, scr_results = parameter_searcher.run(df)

KFoldClusteringValidator

Evaluates how stable clustering assignments remain across different data partitions by comparing clustering results on k-fold subsets with those from the full dataset. ECS is used to quantify similarity/stability between fold-level clustering and the baseline (full dataset).

Example

from crowflow import KFoldClusteringValidator

kfold_validator = KFoldClusteringValidator(KMeans, "random_state", k_folds=5, n_runs=30, n_clusters=3, init="random")
baseline_results, kfolds_robustness_results = kfold_validator.run(df)

PerturbationRobustnessTester

Tests how stable clustering results are when features are altered/perturbed. The user must provide a perturbation function, which modifies the dataset before clustering is re-run. Stability is assessed using Element-Centric Similarity (ECS) between the baseline clustering and perturbation-induced clusterings.

Example

def shuffle_features(X):
    X_shuffled = X.copy()
    for col in X_shuffled.columns:
        X_shuffled[col] = np.random.permutation(X_shuffled[col])
    return X_shuffled

perturb_tester = PerturbationRobustnessTester(KMeans, "random_state", perturbation_func=shuffle_features, n_perturbations=5, n_runs=30)
perturb_results = perturb_tester.run(df)

Installation

crowflow requires Python 3.7 or newer.

Dependencies

  • numpy
  • matplotlib
  • scikit-learn
  • seaborn
  • plotnine
  • ClustAssessPy

User Installation

We recommend installing crowflow in a virtual environment (venv or Conda).

pip install crowflow

Tutorials

The package can be applied to any clustering task (as long as the clustering algorithm used is stochastic).

In the cuomo example, we show how to use crowflow with GaussianMixture from scikit-learn to initially assess clustering stability of the default parameter values. We then attempt to identify a clustering configuration (hyperparameter values) that results in more stable clustering results and finally further optimize that configuration through feature selection.

The fine food reviews example shows how to integrate cutting-edge models (embedding, 4o-mini) from OpenAI and KMeans from scikit-learn with crowflow to extract meaningful insights from robust clusters and generate informative labels based on these insights.

License

This package is released under the MIT License.

Developed by Rafael Kollyfas (rk720@cam.ac.uk), Core Bioinformatics (Mohorianu Lab) group, University of Cambridge. February 2025.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crowflow-1.0.0.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crowflow-1.0.0-py3-none-any.whl (23.0 kB view details)

Uploaded Python 3

File details

Details for the file crowflow-1.0.0.tar.gz.

File metadata

  • Download URL: crowflow-1.0.0.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for crowflow-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4de369f217b2ecd1062864ffc7e5895a1d3a8b9ed30e70349aedaa632286837e
MD5 2e95e56d3440aede677e7060106fd999
BLAKE2b-256 97ab4ae6ad1b6867b586356b37fd4fd3f1db58fa596736039968a49110e6af7d

See more details on using hashes here.

File details

Details for the file crowflow-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: crowflow-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 23.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for crowflow-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d09edfdceac14fde82818a9d3e8e2b077c92b734f21cc8bc2bdb2bfddd6053eb
MD5 e6117464a5eeccc699daf9d28d76b731
BLAKE2b-256 bc66d5dfb074076267fb09b87975aa72a8fd07432b042b23bca6bc0b07c5d92b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page