Skip to main content

A Rust-backed Python clustering library

Project description

rustcluster

Fast, Rust-backed clustering for Python. Six algorithms, sklearn-compatible API, purpose-built embedding clustering with 11x PCA speedup.

Highlights

  • 6 algorithms: KMeans, MiniBatchKMeans, DBSCAN, HDBSCAN, AgglomerativeClustering, EmbeddingCluster
  • EmbeddingCluster — purpose-built pipeline for OpenAI/Cohere/Voyage embeddings (L2-normalize → PCA → spherical K-means)
  • EmbeddingReducer — standalone PCA transformer with save/load (fit once, cluster for free)
  • faer-accelerated PCA — 11x faster than hand-rolled matmul via SIMD-optimized GEMM
  • 3 distance metrics: euclidean, cosine, manhattan
  • 3 evaluation metrics: silhouette score, Calinski-Harabasz, Davies-Bouldin
  • KD-tree acceleration for DBSCAN/HDBSCAN neighbor queries (10-200x on low-d data)
  • Native f32/f64 — no silent upcast, doubles cache efficiency with f32
  • Pickle serialization for all fitted models
  • GIL released during all compute — plays well with threads and async
  • 421 tests across Rust and Python

Installation

pip install rustcluster

Or from source (requires Rust toolchain + Python 3.10+):

pip install maturin
git clone https://github.com/mfbaig35r/rustcluster.git
cd rustcluster
maturin develop --release

Quickstart

K-Means

from rustcluster import KMeans

model = KMeans(n_clusters=3, random_state=42)
model.fit(X)
model.labels_           # cluster assignments
model.cluster_centers_  # centroids (k x d)
model.inertia_          # sum of squared distances
model.predict(X_new)    # assign new data

Embedding Clustering

Purpose-built pipeline for dense embedding vectors (OpenAI, Cohere, Voyage, etc.):

from rustcluster.experimental import EmbeddingCluster

model = EmbeddingCluster(n_clusters=50, reduction_dim=128)
model.fit(embeddings)          # L2-normalize → PCA → spherical K-means
model.labels_                  # cluster assignments
model.cluster_centers_         # unit-norm centroids in reduced space
model.intra_similarity_        # per-cluster cosine similarity
model.reduced_data_            # access PCA-reduced data

EmbeddingReducer (Fit Once, Cluster Many)

PCA is 99% of the embedding pipeline runtime. Separate reduction from clustering to iterate for free:

from rustcluster.experimental import EmbeddingReducer

# Pay the PCA cost once
reducer = EmbeddingReducer(target_dim=128)
X_reduced = reducer.fit_transform(embeddings)  # 323K × 1536 → 128 in ~56s
reducer.save("pca_128.bin")

# Iterate on clustering for free
reducer = EmbeddingReducer.load("pca_128.bin")
X_reduced = reducer.transform(new_embeddings)

EmbeddingCluster(n_clusters=50, reduction_dim=None).fit(X_reduced)   # ~4s
EmbeddingCluster(n_clusters=100, reduction_dim=None).fit(X_reduced)  # ~8s
EmbeddingCluster(n_clusters=200, reduction_dim=None).fit(X_reduced)  # ~15s

Matryoshka models (e.g., text-embedding-3-small) can skip PCA entirely:

reducer = EmbeddingReducer(target_dim=128, method="matryoshka")
X_reduced = reducer.fit_transform(embeddings)  # instant — just truncates + L2-normalizes

See the embedding clustering guide for full documentation.

Mini-Batch K-Means

from rustcluster import MiniBatchKMeans

model = MiniBatchKMeans(n_clusters=3, batch_size=256, random_state=42)
model.fit(X_large)      # scales to large datasets

DBSCAN

from rustcluster import DBSCAN

model = DBSCAN(eps=0.5, min_samples=5)
model.fit(X)
model.labels_                # -1 for noise
model.core_sample_indices_   # core point indices

HDBSCAN

from rustcluster import HDBSCAN

model = HDBSCAN(min_cluster_size=5)
model.fit(X)
model.labels_              # -1 for noise
model.probabilities_       # soft membership [0, 1]
model.cluster_persistence_ # per-cluster stability

Agglomerative Clustering

from rustcluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=3, linkage="ward")
model.fit(X)
model.labels_     # cluster assignments
model.children_   # merge history
model.distances_  # distance at each merge

Evaluation Metrics

from rustcluster import silhouette_score, calinski_harabasz_score, davies_bouldin_score

silhouette_score(X, labels)         # [-1, 1], higher is better
calinski_harabasz_score(X, labels)  # higher is better
davies_bouldin_score(X, labels)     # lower is better

Distance Metrics

All algorithms accept a metric parameter:

KMeans(n_clusters=5, metric="cosine")
DBSCAN(eps=0.3, metric="manhattan")
HDBSCAN(min_cluster_size=5, metric="euclidean")
Metric Aliases KD-tree acceleration Notes
"euclidean" "l2" Yes Default for all algorithms
"cosine" No (brute force) K-means forces Lloyd (Hamerly assumes Euclidean)
"manhattan" "cityblock", "l1" Yes

Ward linkage requires euclidean metric.

Performance

K-Means vs scikit-learn

Single-threaded, n_init=1, median of 5 runs:

n d k Speedup vs sklearn
1,000 8 8 2.9x
10,000 8 8 2.4x
100,000 8 32 3.2x
100,000 32 32 1.4x

DBSCAN and HDBSCAN use KD-tree acceleration for d <= 16 with euclidean or manhattan metrics, reducing neighbor queries from O(n^2) to O(n log n).

Embedding Clustering

Measured on 323K embeddings (text-embedding-3-small, 1536d → 128d, K=98, Apple Silicon):

Workflow Time
Full pipeline (PCA + cluster) 58s
Subsequent run (cached reduced data) 7.5s
5 clustering configs on cached data 74s
Matryoshka (no PCA needed) ~5s

Full benchmarks: python benches/benchmark.py

Serialization

All models support pickle:

import pickle

model = KMeans(n_clusters=3).fit(X)
data = pickle.dumps(model)
model_restored = pickle.loads(data)  # fitted state preserved

EmbeddingReducer uses a compact binary format:

reducer.save("pca_128.bin")                   # 1.5 KB
reducer = EmbeddingReducer.load("pca_128.bin") # instant

Development

maturin develop --release              # build
cargo test --no-default-features --lib # Rust tests (182)
pytest tests/ -v                       # Python tests (239)
python benches/benchmark.py            # benchmark vs sklearn
cargo fmt -- --check                   # formatting
cargo clippy --no-default-features --lib -- -D warnings  # linting

Architecture

Three-layer kernel design separating concerns:

  1. PyO3 boundary (src/lib.rs) — input validation, GIL release, dtype dispatch
  2. Algorithm logic (src/kmeans.rs, etc.) — iteration, convergence, ndarray types
  3. Hot kernel (src/utils.rs, src/distance.rs) — raw &[F] slices for auto-vectorization

The embedding pipeline adds:

  1. Embedding module (src/embedding/) — spherical K-means, PCA (faer-backed), vMF refinement, EmbeddingReducer

See docs/architecture-decisions.md for details and docs/lessons-building-rustcluster.md for the full build story.

Contributing

See CONTRIBUTING.md for how to add algorithms, distance metrics, and tests.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rustcluster-0.4.0.tar.gz (233.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rustcluster-0.4.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

rustcluster-0.4.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (928.3 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ ARM64

rustcluster-0.4.0-cp312-cp312-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.12Windows x86-64

rustcluster-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

rustcluster-0.4.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (928.2 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

rustcluster-0.4.0-cp312-cp312-macosx_11_0_arm64.whl (852.5 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

rustcluster-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

rustcluster-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

rustcluster-0.4.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (930.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

rustcluster-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

rustcluster-0.4.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (930.0 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

rustcluster-0.4.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

rustcluster-0.4.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (931.3 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ ARM64

File details

Details for the file rustcluster-0.4.0.tar.gz.

File metadata

  • Download URL: rustcluster-0.4.0.tar.gz
  • Upload date:
  • Size: 233.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rustcluster-0.4.0.tar.gz
Algorithm Hash digest
SHA256 e9fd9ee8bf3808826ea661086ff4e138e06a636ed0e847a11f2fae364b36203a
MD5 ccda3b94fd3b87f29c8dbfba90581eda
BLAKE2b-256 e142cfd6542f32cccad462923537cae72f9248749d90d16fa9cbd4d9e7f7f4f1

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0.tar.gz:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 86826dbd851da3e5c039fac59b22420adec4cbc906ed0068334f007856142f9a
MD5 0e765ecfa9c6160426df81c7a2381830
BLAKE2b-256 6aa6de0960de07b1b84e416a3f399068af3a90eda81b03a31492a34f82b09700

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2b7a1aa87faa9464c24708c4e8d03ac1c3bdde5ae8c68817a7dec81963650919
MD5 a0df50f3940b4f41d96a6a43058b918d
BLAKE2b-256 9c0ce5ee734c777f1fdfa50938bda56dda0ac4107fe4ba9c9dee7dfeb850625e

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 1a8760eedbd32f11337e8ab049b34bff43c4254bea145480b481b2437cd73e6e
MD5 19988fde5e86be6927bf181279101e29
BLAKE2b-256 fea70b3fab214c857ca94688eaf4b09ee0980150a354c4091dae43068c83a7fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp312-cp312-win_amd64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 26fd29398a8474c59ef76897001c3f9a5b02f8066c2b18716a0a590412907007
MD5 4ddc25c1880ee6a3ee97e1bfb788da5b
BLAKE2b-256 6293a091c3ca4ed3a1eeb80a19335e7428d24d3b17a7d82c2afa96dd36570eb7

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fcb2ec973de54c03c786509094155f02e9f2b1aae8f6531e92a79cc3b5274f08
MD5 17ec235834e91579532be646314aee3a
BLAKE2b-256 4869546d343fffff16f04270c8ac174dbb86b3330ec6c374a4c7d40c0c0e627c

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ab7aa84a50bc213510c5371a24517492f161e96ed2208ec7466fb45c835c32d9
MD5 56026cabbd72d3a26373f0ca10c16d1e
BLAKE2b-256 10ae15b7f0c424c45f114ce64e7206c3fa339689917deb5a74554a431c28717a

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ef78021601c397551c03d1d2725fad455a1f2f9cec20468019c078fcd08decbe
MD5 f72fec14ab2b0ea0414115339fb31faa
BLAKE2b-256 eb6d129ee61eab49a8e194787b702b74b28f6e07ea683b4d217d551c4beb36c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dc9e522b11977d9391fe6199561d60ffe08352e43fcf54173c052b845d8c5e12
MD5 a86348e08ea6507b1e99c5779c3e6fd0
BLAKE2b-256 2620475e09282a03ee9e28d64f36d0f7ceb9154723c442ee91a8432bc1d9af54

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 76ad35e95564bbb7c6c8fe47efe2fdd0a5d21aceaf9ce3745a85ad6640de166f
MD5 1871c5c2ac804b43045c0dc36bb7a605
BLAKE2b-256 7a8e4f3e6a9041b1c58d749a950d5b104e11537e796e7f0f69b8ff57dad0eee5

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d62b59121e0202abcf2d1e3db2c93fd85f6d375b0961a11b95314c2b08f6f8d1
MD5 60d0d160e923b4036695030c20e28b59
BLAKE2b-256 fa56ad23c2b8829a5aa799c4582026f00cc80e0f3743bf52e4dac99ae20fcc4b

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f8ab4489103dda28acaba55df3e82c252d4b7a06d9681a353e2fbe131cf9a852
MD5 4a3262431509fdb4458368ab4f074922
BLAKE2b-256 f40faab744260b5067a42e7ef2afb7f5567f67dd87ce414f75e15cf6995035da

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 585b3500232c03932ae2cef68033f68668f3f15c79a4021f2e567d3be1f0215b
MD5 5ebb4d2853e6b6cd419dee35ff17e1fd
BLAKE2b-256 9f77af22dc5faae1e11004892a9ce179948a3d2baec115321610a35138809a2c

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustcluster-0.4.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustcluster-0.4.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c12252f60491ed04fddd964b990678cd6449dc60566e4d7d5de1997c8b82c2e3
MD5 daac7d701f9e6bb8252b6efa60fd4771
BLAKE2b-256 2bff32b07c9e778fb05982da175036a573d77037c550e3aed18ebc42463427a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustcluster-0.4.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on mfbaig35r/rustcluster

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page