Skip to main content

Density-normalized clustering with minimum spanning tree construction for high-dimensional data

Project description

densitree

Density-normalized clustering with minimum spanning tree construction for high-dimensional data.

densitree implements an improved SPADE algorithm that combines density-dependent downsampling with consensus overclustering to produce accurate cluster assignments and interpretable tree structures. It works on any high-dimensional dataset with imbalanced density -- cytometry, single-cell RNA-seq, proteomics, or general point-cloud data.

Benchmark results

On the standard Levine_32dim CyTOF benchmark (104k cells, 32 markers, 14 populations):

Method ARI NMI Runtime
densitree 0.942 0.930 4.0s
FlowSOM 0.934 0.920 0.1s
FlowSOM (official Python) 0.914 0.914 3.6s
PhenoGraph-style 0.908 0.906 88.0s
KMeans 0.569 0.802 1.3s

Installation

pip install densitree

From source:

git clone https://github.com/fuzue/densitree.git
cd densitree
pip install -e ".[dev]"

Quick start

from densitree import SPADE

# X is any (n_samples, n_features) array or DataFrame
spade = SPADE(n_clusters=20, downsample_target=0.1, random_state=42)
spade.fit(X)

# Cluster labels for every sample
print(spade.labels_)

# Per-cluster statistics
print(spade.result_.cluster_stats_)

# Visualize the spanning tree
spade.result_.plot_tree(color_by=0, backend="matplotlib")

With a pandas DataFrame, column names are preserved:

import pandas as pd

df = pd.read_csv("data.csv")
spade = SPADE(n_clusters=30, random_state=42)
spade.fit(df[["feature_a", "feature_b", "feature_c"]])

# Stats include median_feature_a, median_feature_b, etc.
print(spade.result_.cluster_stats_)

Key features

  • State-of-the-art accuracy -- consensus overclustering with mixed-linkage ensemble beats FlowSOM and PhenoGraph on standard benchmarks
  • scikit-learn compatible -- fit() / fit_predict() API, works with numpy arrays and pandas DataFrames
  • Tree output -- minimum spanning tree reveals hierarchical relationships between clusters
  • Rare population preservation -- density-dependent downsampling ensures small subgroups are not lost
  • Extensible -- swap any pipeline step (density estimation, clustering) via the BaseStep interface
  • Dual visualization -- static matplotlib and interactive plotly backends
  • Reproducible -- deterministic with random_state

How it works

  1. Density estimation -- k-NN local density for every sample
  2. Consensus clustering -- multiple MiniBatchKMeans overclustering runs with ward and average linkage metaclustering, aligned via the Hungarian algorithm and combined by majority vote
  3. Density-dependent downsampling -- rare regions are preserved for tree construction
  4. MST construction -- cluster centroids connected into a minimum spanning tree

Parameters

Parameter Default Description
n_clusters 50 Number of output clusters
downsample_target 0.1 Fraction of samples retained for tree construction
knn 5 Neighbors for density estimation
n_consensus 10 Overclustering runs per linkage (total = 2x). Higher = more stable.
transform "arcsinh" "arcsinh", "log", or None
cofactor 150.0 Arcsinh denominator (5.0 for CyTOF, 150.0 for flow cytometry)
random_state None Seed for reproducibility

Documentation

Full documentation with API reference, tutorials, and benchmark details:

pip install densitree[docs]
mkdocs serve

Running benchmarks

pip install densitree[bench]
cd benchmarks

# Synthetic dataset
python run_benchmark.py synthetic

# Real CyTOF data (downloads Levine_32dim automatically)
python run_benchmark.py Levine_32dim

License

MIT

References

  • Qiu, P. et al. (2011). "Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE." Nature Biotechnology, 29(10), 886-891. doi:10.1038/nbt.1991
  • Levine, J.H. et al. (2015). "Data-Driven Phenotypic Dissection of AML." Cell, 162(1), 184-197. doi:10.1016/j.cell.2015.05.047
  • Samusik, N. et al. (2016). "Automated mapping of phenotype space with single-cell data." Nature Methods, 13(6), 493-496. doi:10.1038/nmeth.3863

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

densitree-0.1.0.tar.gz (251.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

densitree-0.1.0-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file densitree-0.1.0.tar.gz.

File metadata

  • Download URL: densitree-0.1.0.tar.gz
  • Upload date:
  • Size: 251.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for densitree-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bfcca7adf6fad30907e06f02a14220bccd3da52043a20d977a7e83eafd3faf56
MD5 6aa963e3b7ad4f4d4f78b68e836ecb31
BLAKE2b-256 ce5c02fc6d4aed4ff7fa61110680d3b37394fedeb2ed169d2d3bae9880848349

See more details on using hashes here.

Provenance

The following attestation bundles were made for densitree-0.1.0.tar.gz:

Publisher: release.yml on fuzue/densitree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file densitree-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: densitree-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for densitree-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e6ae189c50bb794f9a0861ecbb600cac5a6754589047257e6705793407c1f03a
MD5 b6b9b88aae57e50fb7eff0e81dbeab64
BLAKE2b-256 608ddecd379f47db399979311c5719ca32e292e3778f441efe91a160c6a7f90c

See more details on using hashes here.

Provenance

The following attestation bundles were made for densitree-0.1.0-py3-none-any.whl:

Publisher: release.yml on fuzue/densitree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page