Skip to main content

A Rust-backed Python clustering library

Project description

rustcluster

Fast, Rust-backed clustering for Python. Six algorithms, sklearn-compatible API, purpose-built embedding clustering with 11x PCA speedup.

Highlights

  • 6 algorithms: KMeans, MiniBatchKMeans, DBSCAN, HDBSCAN, AgglomerativeClustering, EmbeddingCluster
  • EmbeddingCluster — purpose-built pipeline for OpenAI/Cohere/Voyage embeddings (L2-normalize → PCA → spherical K-means)
  • EmbeddingReducer — standalone PCA transformer with save/load (fit once, cluster for free)
  • faer-accelerated PCA — 11x faster than hand-rolled matmul via SIMD-optimized GEMM
  • 3 distance metrics: euclidean, cosine, manhattan
  • 3 evaluation metrics: silhouette score, Calinski-Harabasz, Davies-Bouldin
  • KD-tree acceleration for DBSCAN/HDBSCAN neighbor queries (10-200x on low-d data)
  • Native f32/f64 — no silent upcast, doubles cache efficiency with f32
  • Pickle serialization for all fitted models
  • GIL released during all compute — plays well with threads and async
  • 416 tests across Rust and Python

Installation

pip install rustcluster

Or from source (requires Rust toolchain + Python 3.10+):

pip install maturin
git clone https://github.com/mfbaig35r/rustcluster.git
cd rustcluster
maturin develop --release

Quickstart

K-Means

from rustcluster import KMeans

model = KMeans(n_clusters=3, random_state=42)
model.fit(X)
model.labels_           # cluster assignments
model.cluster_centers_  # centroids (k x d)
model.inertia_          # sum of squared distances
model.predict(X_new)    # assign new data

Embedding Clustering

Purpose-built pipeline for dense embedding vectors (OpenAI, Cohere, Voyage, etc.):

from rustcluster.experimental import EmbeddingCluster

model = EmbeddingCluster(n_clusters=50, reduction_dim=128)
model.fit(embeddings)          # L2-normalize → PCA → spherical K-means
model.labels_                  # cluster assignments
model.cluster_centers_         # unit-norm centroids in reduced space
model.intra_similarity_        # per-cluster cosine similarity
model.reduced_data_            # access PCA-reduced data

EmbeddingReducer (Fit Once, Cluster Many)

PCA is 99% of the embedding pipeline runtime. Separate reduction from clustering to iterate for free:

from rustcluster.experimental import EmbeddingReducer

# Pay the PCA cost once
reducer = EmbeddingReducer(target_dim=128)
X_reduced = reducer.fit_transform(embeddings)  # 323K × 1536 → 128 in ~56s
reducer.save("pca_128.bin")

# Iterate on clustering for free
reducer = EmbeddingReducer.load("pca_128.bin")
X_reduced = reducer.transform(new_embeddings)

EmbeddingCluster(n_clusters=50, reduction_dim=None).fit(X_reduced)   # ~4s
EmbeddingCluster(n_clusters=100, reduction_dim=None).fit(X_reduced)  # ~8s
EmbeddingCluster(n_clusters=200, reduction_dim=None).fit(X_reduced)  # ~15s

Matryoshka models (e.g., text-embedding-3-small) can skip PCA entirely:

reducer = EmbeddingReducer(target_dim=128, method="matryoshka")
X_reduced = reducer.fit_transform(embeddings)  # instant — just truncates + L2-normalizes

See the embedding clustering guide for full documentation.

Mini-Batch K-Means

from rustcluster import MiniBatchKMeans

model = MiniBatchKMeans(n_clusters=3, batch_size=256, random_state=42)
model.fit(X_large)      # scales to large datasets

DBSCAN

from rustcluster import DBSCAN

model = DBSCAN(eps=0.5, min_samples=5)
model.fit(X)
model.labels_                # -1 for noise
model.core_sample_indices_   # core point indices

HDBSCAN

from rustcluster import HDBSCAN

model = HDBSCAN(min_cluster_size=5)
model.fit(X)
model.labels_              # -1 for noise
model.probabilities_       # soft membership [0, 1]
model.cluster_persistence_ # per-cluster stability

Agglomerative Clustering

from rustcluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=3, linkage="ward")
model.fit(X)
model.labels_     # cluster assignments
model.children_   # merge history
model.distances_  # distance at each merge

Evaluation Metrics

from rustcluster import silhouette_score, calinski_harabasz_score, davies_bouldin_score

silhouette_score(X, labels)         # [-1, 1], higher is better
calinski_harabasz_score(X, labels)  # higher is better
davies_bouldin_score(X, labels)     # lower is better

Distance Metrics

All algorithms accept a metric parameter:

KMeans(n_clusters=5, metric="cosine")
DBSCAN(eps=0.3, metric="manhattan")
HDBSCAN(min_cluster_size=5, metric="euclidean")
Metric Aliases KD-tree acceleration Notes
"euclidean" "l2" Yes Default for all algorithms
"cosine" No (brute force) K-means forces Lloyd (Hamerly assumes Euclidean)
"manhattan" "cityblock", "l1" Yes

Ward linkage requires euclidean metric.

Performance

K-Means vs scikit-learn

Single-threaded, n_init=1, median of 5 runs:

n d k Speedup vs sklearn
1,000 8 8 2.9x
10,000 8 8 2.4x
100,000 8 32 3.2x
100,000 32 32 1.4x

DBSCAN and HDBSCAN use KD-tree acceleration for d <= 16 with euclidean or manhattan metrics, reducing neighbor queries from O(n^2) to O(n log n).

Embedding Clustering

Measured on 323K embeddings (text-embedding-3-small, 1536d → 128d, K=98, Apple Silicon):

Workflow Time
Full pipeline (PCA + cluster) 58s
Subsequent run (cached reduced data) 7.5s
5 clustering configs on cached data 74s
Matryoshka (no PCA needed) ~5s

Full benchmarks: python benches/benchmark.py

Serialization

All models support pickle:

import pickle

model = KMeans(n_clusters=3).fit(X)
data = pickle.dumps(model)
model_restored = pickle.loads(data)  # fitted state preserved

EmbeddingReducer uses a compact binary format:

reducer.save("pca_128.bin")                   # 1.5 KB
reducer = EmbeddingReducer.load("pca_128.bin") # instant

Development

maturin develop --release              # build
cargo test --no-default-features --lib # Rust tests (182)
pytest tests/ -v                       # Python tests (234)
python benches/benchmark.py            # benchmark vs sklearn
cargo fmt -- --check                   # formatting
cargo clippy --no-default-features --lib -- -D warnings  # linting

Architecture

Three-layer kernel design separating concerns:

  1. PyO3 boundary (src/lib.rs) — input validation, GIL release, dtype dispatch
  2. Algorithm logic (src/kmeans.rs, etc.) — iteration, convergence, ndarray types
  3. Hot kernel (src/utils.rs, src/distance.rs) — raw &[F] slices for auto-vectorization

The embedding pipeline adds:

  1. Embedding module (src/embedding/) — spherical K-means, PCA (faer-backed), vMF refinement, EmbeddingReducer

See docs/architecture-decisions.md for details and docs/lessons-building-rustcluster.md for the full build story.

Contributing

See CONTRIBUTING.md for how to add algorithms, distance metrics, and tests.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rustcluster-0.3.3.tar.gz (232.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rustcluster-0.3.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

rustcluster-0.3.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (920.5 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ ARM64

rustcluster-0.3.3-cp312-cp312-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.12Windows x86-64

rustcluster-0.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

rustcluster-0.3.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (920.4 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

rustcluster-0.3.3-cp312-cp312-macosx_11_0_arm64.whl (846.6 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

rustcluster-0.3.3-cp312-cp312-macosx_10_12_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

rustcluster-0.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

rustcluster-0.3.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (920.9 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

rustcluster-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

rustcluster-0.3.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (920.9 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

rustcluster-0.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

rustcluster-0.3.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (922.0 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ ARM64

File details

Details for the file rustcluster-0.3.3.tar.gz.

File metadata

  • Download URL: rustcluster-0.3.3.tar.gz
  • Upload date:
  • Size: 232.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rustcluster-0.3.3.tar.gz
Algorithm Hash digest
SHA256 59d0b55ea7bcdab74e1b73e49042f24e7ae2ab5ee5c8b7f13b2b5f01ca8d2849
MD5 9cd3b428fa1a5c9371be339f6212cefe
BLAKE2b-256 69c3b1b3df6e64461ac247315b9f85c99c64262160a0d2b7f752d5d9d509d086

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3.tar.gz:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2e5a2126a3c7f02ad89d717646026fe404cc4a6bb591af60a376e7418582ff51
MD5 65364f7fcd46f19a3930bfb060179403
BLAKE2b-256 b82e34c37e32a7c5f8dda4f04f1927e31bbeda42dd7111c32e12e969bc56875f

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b4312b2425e0a849a0ff9fc233137f9f23951ae24befc45f6ac99ee26e21ee37
MD5 8e3e5dfb71c8a7a8cc344fe82c6524b2
BLAKE2b-256 62154cdd98237d3550c887436c5c79f6b50f267e30eca4c4c6410175c1b54282

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 9faf88e8c125c713a94616e5d734908baca374bb21f73d9ced2c792587664e6c
MD5 352a99b47797eb699df6c8091d324b45
BLAKE2b-256 a49211f1e5352754af3146c8fba3ed320b6930e834a83e75dddfed520773274b

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp312-cp312-win_amd64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 adcdfca12cf7ef32fbdc4ce4ae49a8b3e24a38312d47042bff57ff5a919c5190
MD5 67bd4764272ca80d54af8ea013df3669
BLAKE2b-256 0104705fc36020cb0783c360fb6b881f006a4547e0a13c299578d7cda6003380

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 0f8f106122bdd646809310038e4b33b6e6d76d7816a351e8e1d1390863826790
MD5 de19c5f4ee0cada8d9ea5a308bc369d1
BLAKE2b-256 c04254f288d7733a20653b51815e38e22477125cdb1edc4181ceee61dd7660a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7d750c333e77b93584dbaf3bd669d098290eb1b73891fcb7378be732e50c7759
MD5 ae96e8acfee365d5847c45451e358df7
BLAKE2b-256 74bf69dc72b97ac2af531376bcdef586d8175616e11f62c07b878b0aa64280d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a7f44bd22a1da60d53515dad32c2a04876c28015d3d6a1aa3f99b4440d211bc6
MD5 230c7da5987572976dba3bce3434c666
BLAKE2b-256 65252ad320689313b3d3dd7b73886250ebc1ac9c080d748ae9e5772056fa881c

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6f9c818e89eac2565d689bfff99ccc167757bb622b6f75aafa87b3ff3ac1eb2a
MD5 28177721b1d2e55d0401c120b0d3836e
BLAKE2b-256 4b1981f8834f07a19dff5de0675d81197fbdb0eb94f47716c020f6aba32d9f8c

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 59bb70302899541191ea52f9759c2978b40e9a95661a9b90b857ea3cd075ca22
MD5 c1ce92d93010bbc34c58f64089804d35
BLAKE2b-256 79a1d7e8091fd36c64ee8bb359d1ab449cfcf266def8782a4563d1a86c8ea0ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3354ceede27df14f90c60860cf0b89864339639d7e45dea2bb34c48fed609871
MD5 cd0de4818fa1ef19691521fba27c1d6b
BLAKE2b-256 aba97bac4d9c4b2fa1d88cf9b3136be85611d581a206b87f0562cfe1643ded34

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 185a13d62b1849ffee060f12057c74ab4529a38858f0852b29f71a13ffc81ed4
MD5 517cb9ed6358f13e288642adcabf0bc2
BLAKE2b-256 f216461fdd5f7fac86767a544cb8e0d61cdba71b18729dbe92858b79f78ebc06

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3eabdbe3e3e1568ba3e4df15b933bc22a24552eabde396cb02ce864403f780f1
MD5 5add90d30b7cc3db5daf8d11f7d29e12
BLAKE2b-256 b9b61c3a2a10ab795300aeaf92f9813bcd591ff0180bfe58ce73cc67bca7b918

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 dafd5f8f77e2ac127a11cee5d896b83b2ce9aff7f6a26992e155bebf4cb9759e
MD5 4d8c34d7b8c05d6ef9e2eb85879278d1
BLAKE2b-256 ac7f7c8b7f6f50945a177f1ccd690b0dc9e2651a90065b657ea40b9fa6601f06

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page