Skip to main content

Cluster Analysis with Resampling for Validation and Exploration: stability- and generalizability-based clustering validation.

Project description

CI R CMD check Python 3.12 Version

CARVE

Cluster Analysis with Resampling for Validation and Exploration

Choosing the number of clusters is hard, especially for high-dimensional biological data where standard internal clustering validation indices (CVIs) are often unreliable. CARVE measures clustering robustness through two resampling-based concepts: stability (reproducibility of cluster assignments under data subsampling) and generalizability (agreement between held-out cluster labels and predictions from a classifier trained on a subsample of the data). CARVE reports global, cluster-level, and sample-level diagnostics with visualizations, all through a scikit-learn-compatible API.

CARVE overview

Features

  • Scikit-learn-compatible API: CARVE extends BaseEstimator with a fit / get_labels / get_k workflow
  • Stability (ARI on subsample overlap) and generalizability (ARI on held-out predictions) metrics
  • Diagnostics at the global, per-cluster, and per-sample level
  • Metrics: stability and generalizability ARIs, consensus PAC, Gini, cross-entropy, and predictive accuracy
  • Selection rules: max, 1se (one-standard-error), and quantile
  • Custom spectral clustering with self-tuning affinity (based on Zelnik-Manor & Perona, Self-Tuning Spectral Clustering, NeurIPS 2004)
  • Plots: metric-over-k curves, consensus heatmaps, box plots, violin plots, and scatter plots
  • Parallel resampling via joblib (n_jobs)

Installation

CARVE requires Python 3.12.

pip install carve-validate

The distribution is named carve-validate; the import name is carve:

from carve import CARVE

From source (development)

git clone https://github.com/DataSlingers/CARVE.git
cd CARVE
pip install -e ".[dev]"        # linting + testing
pip install -e ".[notebooks]"  # Jupyter, Scanpy, scVI, etc.

Quick Start

from carve import CARVE
from sklearn.datasets import make_blobs

# Generate synthetic data
X, y_true = make_blobs(n_samples=500, n_features=10, centers=5, random_state=42)

# Fit CARVE
carve = CARVE(n_clusters=10, n_resamples=120, subsample_ratio=0.7, n_jobs=4)
carve.fit(X)

# Select best k and retrieve labels
k = carve.get_k(measure="generalizability", rule="1se")
labels = carve.get_labels(measure="generalizability", rule="1se")
print(f"Selected k={k}")

See notebooks/Tutorial.ipynb for a walkthrough, and notebooks/case_studies/ for real-world analyses on scRNA-seq and mass cytometry datasets.

Visualization

# Metric curves across k
carve.plot_metric_over_n_clusters(measure="stability", rule="1se")

# Consensus heatmap for the selected solution
carve.plot_consensus_matrix(measure="generalizability", rule="1se")

# Per-cluster stability violin plot
carve.plot_cluster_violin(source="gini", measure="generalizability", rule="1se")

# 2D scatter with score-encoded marker size and opacity
carve.plot_cluster_scatter(source="gini", measure="generalizability", rule="1se")

All plotting methods return a matplotlib Axes object and accept save and dpi parameters for export.

Citation

If you use CARVE in your research, please cite:

Wycik, K. R., Tang, T. M., Zikry, T. M., & Allen, G. I. (2026). CARVE: Cluster Analysis with Resampling for Validation and Exploration. Zenodo. https://doi.org/10.5281/zenodo.20448965

@software{wycik2026carve,
  author    = {Wycik, Kai R. and Tang, Tiffany M. and Zikry, Tarek M. and Allen, Genevera I.},
  title     = {{CARVE}: Cluster Analysis with Resampling for Validation and Exploration},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20448965},
  url       = {https://doi.org/10.5281/zenodo.20448965}
}

Authors

  • Kai R. Wycik — Columbia University
  • Tiffany M. Tang — University of Notre Dame
  • Tarek M. Zikry — UNC Chapel Hill
  • Genevera I. Allen — Columbia University

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

This project uses Ruff for linting and formatting, and pytest for testing:

ruff check src/       # lint
ruff format src/      # format
pytest -v             # run tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

carve_validate-1.0.0.tar.gz (63.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

carve_validate-1.0.0-py3-none-any.whl (49.0 kB view details)

Uploaded Python 3

File details

Details for the file carve_validate-1.0.0.tar.gz.

File metadata

  • Download URL: carve_validate-1.0.0.tar.gz
  • Upload date:
  • Size: 63.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for carve_validate-1.0.0.tar.gz
Algorithm Hash digest
SHA256 da98a624358524f422cda7d7ade4f230500d1d439704bb91958436fa21df5b32
MD5 1bf69cac313f0d5254c39e6d79f1f5e2
BLAKE2b-256 52d6141fe475b25e7856eaa74c7af994f695043af3ba09588411daf807d84fca

See more details on using hashes here.

File details

Details for the file carve_validate-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: carve_validate-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 49.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for carve_validate-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1acd20ccbb864d575fca4781725a672a9ca62e3a897911a33c9e8dde0e9f2156
MD5 a6aedc5d27781d6405c2245376f36d1e
BLAKE2b-256 e9ffae921c20c6b48acc9753efda772cd3c3ba587bb8dfb40a9b3af5f581c41c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page