Cluster Analysis with Resampling for Validation and Exploration: stability- and generalizability-based clustering validation.
Project description
CARVE
Cluster Analysis with Resampling for Validation and Exploration
Choosing the number of clusters is hard, especially for high-dimensional biological data where standard internal clustering validation indices (CVIs) are often unreliable. CARVE measures clustering robustness through two resampling-based concepts: stability (reproducibility of cluster assignments under data subsampling) and generalizability (agreement between held-out cluster labels and predictions from a classifier trained on a subsample of the data). CARVE reports global, cluster-level, and sample-level diagnostics with visualizations, all through a scikit-learn-compatible API.
Features
- Scikit-learn-compatible API:
CARVEextendsBaseEstimatorwith afit/get_labels/get_kworkflow - Stability (ARI on subsample overlap) and generalizability (ARI on held-out predictions) metrics
- Diagnostics at the global, per-cluster, and per-sample level
- Metrics: stability and generalizability ARIs, consensus PAC, Gini, cross-entropy, and predictive accuracy
- Selection rules:
max,1se(one-standard-error), andquantile - Custom spectral clustering with self-tuning affinity (based on Zelnik-Manor & Perona, Self-Tuning Spectral Clustering, NeurIPS 2004)
- Plots: metric-over-k curves, consensus heatmaps, box plots, violin plots, and scatter plots
- Parallel resampling via joblib (
n_jobs)
Installation
CARVE requires Python 3.12.
pip install carve-validate
The distribution is named carve-validate; the import name is carve:
from carve import CARVE
From source (development)
git clone https://github.com/DataSlingers/CARVE.git
cd CARVE
pip install -e ".[dev]" # linting + testing
pip install -e ".[notebooks]" # Jupyter, Scanpy, scVI, etc.
Quick Start
from carve import CARVE
from sklearn.datasets import make_blobs
# Generate synthetic data
X, y_true = make_blobs(n_samples=500, n_features=10, centers=5, random_state=42)
# Fit CARVE
carve = CARVE(n_clusters=10, n_resamples=120, subsample_ratio=0.7, n_jobs=4)
carve.fit(X)
# Select best k and retrieve labels
k = carve.get_k(measure="generalizability", rule="1se")
labels = carve.get_labels(measure="generalizability", rule="1se")
print(f"Selected k={k}")
See notebooks/Tutorial.ipynb for a walkthrough, and notebooks/case_studies/ for real-world analyses on scRNA-seq and mass cytometry datasets.
Visualization
# Metric curves across k
carve.plot_metric_over_n_clusters(measure="stability", rule="1se")
# Consensus heatmap for the selected solution
carve.plot_consensus_matrix(measure="generalizability", rule="1se")
# Per-cluster stability violin plot
carve.plot_cluster_violin(source="gini", measure="generalizability", rule="1se")
# 2D scatter with score-encoded marker size and opacity
carve.plot_cluster_scatter(source="gini", measure="generalizability", rule="1se")
All plotting methods return a matplotlib Axes object and accept save and dpi parameters for export.
Citation
If you use CARVE in your research, please cite:
Wycik, K. R., Tang, T. M., Zikry, T. M., & Allen, G. I. (2026). CARVE: Cluster Analysis with Resampling for Validation and Exploration. Zenodo. https://doi.org/10.5281/zenodo.20448965
@software{wycik2026carve,
author = {Wycik, Kai R. and Tang, Tiffany M. and Zikry, Tarek M. and Allen, Genevera I.},
title = {{CARVE}: Cluster Analysis with Resampling for Validation and Exploration},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.20448965},
url = {https://doi.org/10.5281/zenodo.20448965}
}
Authors
- Kai R. Wycik — Columbia University
- Tiffany M. Tang — University of Notre Dame
- Tarek M. Zikry — UNC Chapel Hill
- Genevera I. Allen — Columbia University
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
This project uses Ruff for linting and formatting, and pytest for testing:
ruff check src/ # lint
ruff format src/ # format
pytest -v # run tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file carve_validate-1.0.0.tar.gz.
File metadata
- Download URL: carve_validate-1.0.0.tar.gz
- Upload date:
- Size: 63.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da98a624358524f422cda7d7ade4f230500d1d439704bb91958436fa21df5b32
|
|
| MD5 |
1bf69cac313f0d5254c39e6d79f1f5e2
|
|
| BLAKE2b-256 |
52d6141fe475b25e7856eaa74c7af994f695043af3ba09588411daf807d84fca
|
File details
Details for the file carve_validate-1.0.0-py3-none-any.whl.
File metadata
- Download URL: carve_validate-1.0.0-py3-none-any.whl
- Upload date:
- Size: 49.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1acd20ccbb864d575fca4781725a672a9ca62e3a897911a33c9e8dde0e9f2156
|
|
| MD5 |
a6aedc5d27781d6405c2245376f36d1e
|
|
| BLAKE2b-256 |
e9ffae921c20c6b48acc9753efda772cd3c3ba587bb8dfb40a9b3af5f581c41c
|