Skip to main content

A Rust-backed Python clustering library

Project description

rustcluster

Fast, Rust-backed clustering for Python. Six algorithms, sklearn-compatible API, purpose-built embedding clustering with 11x PCA speedup.

Highlights

  • 6 algorithms: KMeans, MiniBatchKMeans, DBSCAN, HDBSCAN, AgglomerativeClustering, EmbeddingCluster
  • EmbeddingCluster — purpose-built pipeline for OpenAI/Cohere/Voyage embeddings (L2-normalize → PCA → spherical K-means)
  • EmbeddingReducer — standalone PCA transformer with save/load (fit once, cluster for free)
  • faer-accelerated PCA — 11x faster than hand-rolled matmul via SIMD-optimized GEMM
  • 3 distance metrics: euclidean, cosine, manhattan
  • 3 evaluation metrics: silhouette score, Calinski-Harabasz, Davies-Bouldin
  • KD-tree acceleration for DBSCAN/HDBSCAN neighbor queries (10-200x on low-d data)
  • Native f32/f64 — no silent upcast, doubles cache efficiency with f32
  • Pickle serialization for all fitted models
  • GIL released during all compute — plays well with threads and async
  • 416 tests across Rust and Python

Installation

pip install rustcluster

Or from source (requires Rust toolchain + Python 3.10+):

pip install maturin
git clone https://github.com/mfbaig35r/rustcluster.git
cd rustcluster
maturin develop --release

Quickstart

K-Means

from rustcluster import KMeans

model = KMeans(n_clusters=3, random_state=42)
model.fit(X)
model.labels_           # cluster assignments
model.cluster_centers_  # centroids (k x d)
model.inertia_          # sum of squared distances
model.predict(X_new)    # assign new data

Embedding Clustering

Purpose-built pipeline for dense embedding vectors (OpenAI, Cohere, Voyage, etc.):

from rustcluster.experimental import EmbeddingCluster

model = EmbeddingCluster(n_clusters=50, reduction_dim=128)
model.fit(embeddings)          # L2-normalize → PCA → spherical K-means
model.labels_                  # cluster assignments
model.cluster_centers_         # unit-norm centroids in reduced space
model.intra_similarity_        # per-cluster cosine similarity
model.reduced_data_            # access PCA-reduced data

EmbeddingReducer (Fit Once, Cluster Many)

PCA is 99% of the embedding pipeline runtime. Separate reduction from clustering to iterate for free:

from rustcluster.experimental import EmbeddingReducer

# Pay the PCA cost once
reducer = EmbeddingReducer(target_dim=128)
X_reduced = reducer.fit_transform(embeddings)  # 323K × 1536 → 128 in ~56s
reducer.save("pca_128.bin")

# Iterate on clustering for free
reducer = EmbeddingReducer.load("pca_128.bin")
X_reduced = reducer.transform(new_embeddings)

EmbeddingCluster(n_clusters=50, reduction_dim=None).fit(X_reduced)   # ~4s
EmbeddingCluster(n_clusters=100, reduction_dim=None).fit(X_reduced)  # ~8s
EmbeddingCluster(n_clusters=200, reduction_dim=None).fit(X_reduced)  # ~15s

Matryoshka models (e.g., text-embedding-3-small) can skip PCA entirely:

reducer = EmbeddingReducer(target_dim=128, method="matryoshka")
X_reduced = reducer.fit_transform(embeddings)  # instant — just truncates + L2-normalizes

See the embedding clustering guide for full documentation.

Mini-Batch K-Means

from rustcluster import MiniBatchKMeans

model = MiniBatchKMeans(n_clusters=3, batch_size=256, random_state=42)
model.fit(X_large)      # scales to large datasets

DBSCAN

from rustcluster import DBSCAN

model = DBSCAN(eps=0.5, min_samples=5)
model.fit(X)
model.labels_                # -1 for noise
model.core_sample_indices_   # core point indices

HDBSCAN

from rustcluster import HDBSCAN

model = HDBSCAN(min_cluster_size=5)
model.fit(X)
model.labels_              # -1 for noise
model.probabilities_       # soft membership [0, 1]
model.cluster_persistence_ # per-cluster stability

Agglomerative Clustering

from rustcluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=3, linkage="ward")
model.fit(X)
model.labels_     # cluster assignments
model.children_   # merge history
model.distances_  # distance at each merge

Evaluation Metrics

from rustcluster import silhouette_score, calinski_harabasz_score, davies_bouldin_score

silhouette_score(X, labels)         # [-1, 1], higher is better
calinski_harabasz_score(X, labels)  # higher is better
davies_bouldin_score(X, labels)     # lower is better

Distance Metrics

All algorithms accept a metric parameter:

KMeans(n_clusters=5, metric="cosine")
DBSCAN(eps=0.3, metric="manhattan")
HDBSCAN(min_cluster_size=5, metric="euclidean")
Metric Aliases KD-tree acceleration Notes
"euclidean" "l2" Yes Default for all algorithms
"cosine" No (brute force) K-means forces Lloyd (Hamerly assumes Euclidean)
"manhattan" "cityblock", "l1" Yes

Ward linkage requires euclidean metric.

Performance

K-Means vs scikit-learn

Single-threaded, n_init=1, median of 5 runs:

n d k Speedup vs sklearn
1,000 8 8 2.9x
10,000 8 8 2.4x
100,000 8 32 3.2x
100,000 32 32 1.4x

DBSCAN and HDBSCAN use KD-tree acceleration for d <= 16 with euclidean or manhattan metrics, reducing neighbor queries from O(n^2) to O(n log n).

Embedding Clustering

Measured on 323K embeddings (text-embedding-3-small, 1536d → 128d, K=98, Apple Silicon):

Workflow Time
Full pipeline (PCA + cluster) 58s
Subsequent run (cached reduced data) 7.5s
5 clustering configs on cached data 74s
Matryoshka (no PCA needed) ~5s

Full benchmarks: python benches/benchmark.py

Serialization

All models support pickle:

import pickle

model = KMeans(n_clusters=3).fit(X)
data = pickle.dumps(model)
model_restored = pickle.loads(data)  # fitted state preserved

EmbeddingReducer uses a compact binary format:

reducer.save("pca_128.bin")                   # 1.5 KB
reducer = EmbeddingReducer.load("pca_128.bin") # instant

Development

maturin develop --release              # build
cargo test --no-default-features --lib # Rust tests (182)
pytest tests/ -v                       # Python tests (234)
python benches/benchmark.py            # benchmark vs sklearn
cargo fmt -- --check                   # formatting
cargo clippy --no-default-features --lib -- -D warnings  # linting

Architecture

Three-layer kernel design separating concerns:

  1. PyO3 boundary (src/lib.rs) — input validation, GIL release, dtype dispatch
  2. Algorithm logic (src/kmeans.rs, etc.) — iteration, convergence, ndarray types
  3. Hot kernel (src/utils.rs, src/distance.rs) — raw &[F] slices for auto-vectorization

The embedding pipeline adds:

  1. Embedding module (src/embedding/) — spherical K-means, PCA (faer-backed), vMF refinement, EmbeddingReducer

See docs/architecture-decisions.md for details and docs/lessons-building-rustcluster.md for the full build story.

Contributing

See CONTRIBUTING.md for how to add algorithms, distance metrics, and tests.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rustcluster-0.3.2.tar.gz (229.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rustcluster-0.3.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

rustcluster-0.3.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (920.3 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ ARM64

rustcluster-0.3.2-cp312-cp312-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.12Windows x86-64

rustcluster-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

rustcluster-0.3.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (920.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

rustcluster-0.3.2-cp312-cp312-macosx_11_0_arm64.whl (846.2 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

rustcluster-0.3.2-cp312-cp312-macosx_10_12_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

rustcluster-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

rustcluster-0.3.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (920.8 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

rustcluster-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

rustcluster-0.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (920.6 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

rustcluster-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

rustcluster-0.3.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (921.9 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ ARM64

File details

Details for the file rustcluster-0.3.2.tar.gz.

File metadata

  • Download URL: rustcluster-0.3.2.tar.gz
  • Upload date:
  • Size: 229.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rustcluster-0.3.2.tar.gz
Algorithm Hash digest
SHA256 a80e6af3372253d2158e24c395a87cd8204247d3305918fb605e107a3b5c1d11
MD5 bbbc50e6e0c1183c51535d5f6382fd4b
BLAKE2b-256 8fb9d002861921d6b153e7832adcc46885fe6fbb2f64927e9a51bb374b86da37

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2.tar.gz:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2c17abf89758f70b71b4191bfc595ae053ec904f8a96c8de140be0cc150389a7
MD5 f4b65076f069fb130101460ce23cd713
BLAKE2b-256 d8d6ba10bf775b1b40aa902b446288da2e7f2d51153671dac64137b82f79b2b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 784a7abb98d4d5287c09059abd749a7875fd5cf2826abd1cac547ce5e45fe419
MD5 a69a828960f7e5cad470d497e16485dc
BLAKE2b-256 349b08fe4411ed587479325c4cebf47eed6a5268cb086ad32896106fb57c9bd1

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 4b2bd8979bee05a336d38bea884f7ca5cd6e23469a7d60712747515083620215
MD5 c079b43e4d5742d25b5b8690069f06e5
BLAKE2b-256 ec7df3759e28d87da433aa3252638c7532ed6edd0761675f88f7c8fc436686c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp312-cp312-win_amd64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d76f9b4052a6df6babce140de9029f2df7fb49be53de62aa01ec284ff5255765
MD5 d506b127ba273a47ed283a123bfe993b
BLAKE2b-256 27a3efc64dcde002cbd00877286cca81ef869b71a3347923b894511eb1eaaba8

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7c358442e4256f22cf51ff5e646f1b0fcafcc43abdc99e87ae77a25874668390
MD5 91293650d1de0a8772d041b4520549f0
BLAKE2b-256 c94712da28fb29b2ca614a2f4caffe4bc1d2b0cd81e77c256a38398160c1943a

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5e644543b0d4411df61bc24050117d5e98e0bad2a2ca12ce9675fb44a318165b
MD5 e16ad4389cc53c12fd5fff5ab626f5e6
BLAKE2b-256 9ab3c8022f41c10042ea4629fa6803133fb95876665eb571c9a510eb22249c37

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 eab09a9157d162e0d7ac8e9aa4d80b373070d2ab3f4436e20ff5ee0338b5af5d
MD5 367ab733b625b91f54c35cadea1cc8dc
BLAKE2b-256 2ed9dc1fee6aa6c3628bb835081c39e6107f1ba4e382848bfc91aff8dd57c504

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 293eacd20d9adc9072b14b5f5617fb410b13cb9ba4aac6859f839ccb52a7a730
MD5 40935137ed7bf45c1257ec098db1c951
BLAKE2b-256 44880e5a1fb9057626f0dec46e9f75d29ff3d4102092da239d0f5a1a10944830

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 16e1725ec8f00776192f13c499071521ef505689dda3c27e10c884530d835717
MD5 662b6fefda6990780232fa3e4cc46775
BLAKE2b-256 c3e8cc6bcf26d5c6f6fb4cc3e050988cb6cf028fb768c910cfa459548f8ed92f

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a69ff5f10d9318de5ce9eeea63be9a305fdb10e9b1210860a9d57c0f8d8eee34
MD5 feeede3357aaa22979830101228c8d35
BLAKE2b-256 0a9fffb76f2f06bc0996be6c3d696ed8b4dcd1bfe618b8595347f144148da989

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 50ca15b274267fbf8cbb4e23e72bb832bf2ebf428bb2d685ef31f5e3c8c3168d
MD5 3c5df0970e5b40ec917c4fab85e52da4
BLAKE2b-256 afd6c78c5cc337a355223cd176eb2fe32b9a2b62e20538977b3de924f881e246

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1b792608f0ef3f88ff0a4eb035a1ccc1fe3f78378063443d46b7685325fd4f6d
MD5 a17d1c8d682718a67e51e781da3bc0b8
BLAKE2b-256 e2ad794f85ac2e855473d371ab4ce497a10ccf2c69ac2107ef51548e5a15b96e

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.3.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.3.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 da135a31fa577aa272557648677d09ba6f646a93f01e6e4fdfd3ea5411f6219e
MD5 17aa51812a5fb6d2168127b482b9e4af
BLAKE2b-256 99f453c8c733cdcc2f73016671e8d77e74d86533b216fb5ab361968670a39a03

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.3.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page