Skip to main content

Modern Rust rewrite of the original pqkmeans project for large-scale clustering with numpy and parquet workflows

Project description

clostera: The Billion-Vector Resurrection

clostera benchmark summary

They told you that clustering massive high-dimensional vector collections on a single machine was a fool's errand. They said you needed a cluster, a distributed headache, and a cloud bill large enough to ruin your week. They were wrong.

clostera is a from-scratch Rust rebuild of the original pqkmeans repository, aimed at the workloads that made that project exciting in the first place: extremely large vector collections, high dimensionality, single-machine practicality, and performance that is measured rather than hoped for.

This is not a thin wrapper around old code. It is a modern rewrite with a new Rust core, a NumPy-first Python layer, parquet and out-of-core workflows, deterministic benchmarks, automatic number-of-clusters (K) selection, Apple Silicon support, and wheels that install like a normal Python package.

Rust coreRayonOpenBLAS/LAPACKAVX2/SSEApple Silicon NEONNumPy + parquetmanylinux + macOS wheels

pip install clostera

⚡️ Quick Start: It just works

The zero-tuning path

import numpy as np
import clostera

vectors = np.load("vectors.npy").astype(np.float32)
clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(vectors)

print(clusterer.selected_k_)  # selected K = selected number of clusters

That is the default story: one object, raw vectors in, labels out, OPQ-enabled quality path by default, and automatic number-of-clusters (K) selection when you do not know the answer up front.

The fastest path

clusterer = clostera.Clusterer(k=256, fastest=True)  # K = number of clusters
labels = clusterer.fit_transform(vectors)

fastest=True turns off OPQ and uses the plain PQ path. That is the right choice when end-to-end throughput matters more than reconstruction quality. The main speed win is in encoder training and encoding; the final compressed assignment stage itself is already fast in both modes.

Out-of-core from parquet

clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform("vectors.parquet")

If the original float vectors do not fit comfortably in RAM, add max_ram_bytes=.... If they do fit, you do not need to think about it.

⚡️ The Miracle of 30.8x: Bending Time

The original repository proved a powerful idea: by clustering in PQ code space instead of dense float space, single-machine clustering suddenly stops sounding ridiculous. That idea aged well. The surrounding implementation did not.

clostera asks the obvious follow-up question:

what happens if you rebuild the original pqkmeans project properly for modern hardware and modern Python workflows?

On the committed deterministic 10M x 2048 checkpoint, the answer is not subtle.

Metric (10M x 2048) original clostera-fastest clostera-quality
Encode time 222.94s 7.24s 131.34s
Cluster time 80.19s 4.50s 4.39s
Reconstruction MSE 0.15160 0.12354 0.05494
Purity 0.6573 1.0000 1.0000

That means:

  • 30.8x faster encoding than the original implementation on the headline checkpoint.
  • 17.8x faster clustering on the same full-core run.
  • Better clustering quality even on the fastest path.
  • A quality-first OPQ mode that dramatically lowers reconstruction error when fidelity matters more than raw throughput.

10M by 2048 benchmark figure

💾 The Alchemy of Memory: Zero-RAM Scaling

At billion-vector scale, the algorithm is only half the story. Memory movement is usually the real bottleneck.

clostera is built around that reality:

  • raw numpy.ndarray input works out of the box
  • parquet is a first-class input format
  • fixed-size-list vector columns and plain numeric scalar columns are both supported
  • max_ram_bytes bounds the working set when the original float vectors do not fit
  • raw vectors can be streamed while PQ codes spill to disk automatically when needed
  • numpy.memmap fits naturally into the same workflow

This is the practical difference between a paper result and a pipeline you can actually operate.

A 2D example using k-means, clostera-quality, and clostera-fastest

2D comparison of k-means, clostera-quality, and clostera-fastest

Large-scale evaluation

Large-scale evaluation summary table

🧠 The Oracle of K: Automatic number of clusters without guesswork

Choosing K (the number of clusters) used to mean elbow plots, trial-and-error, and pretending you were more certain than you really were.

clostera lets you pass k=None to Clusterer, PQKMeans, or OPQMeans when you do not know the number of clusters in advance. The candidate analysis runs in Rust, reuses the already-trained encoder and the already-encoded PQ code matrix, and does not regenerate the expensive intermediate artifacts for each candidate number of clusters (K).

On the committed deterministic benchmark sweep, the default centroid_silhouette selector recovered the exact true cluster count in 20/20 cases.

  • centroid_silhouette: 20/20 exact matches, 0.00 mean absolute error
  • davies_bouldin: 18/20 exact matches, 0.90 mean absolute error
  • elbow: 18/20 exact matches, 1.60 mean absolute error
  • bic: 3/20 exact matches, 50.40 mean absolute error

Automatic number-of-clusters (K) selection benchmark figure

💎 The Obsidian Core: Engineered for modern silicon

clostera is built for people who care about practical speed, reproducibility, and a sane deployment story.

  • Clusterer is the simple default API for normal use.
  • fastest=True gives you the maximum-throughput plain-PQ path.
  • The default path keeps OPQ on and favors quality.
  • The advanced split into PQEncoder / PQKMeans and OPQEncoder / OPQMeans is still there when you need it.
  • The hot paths use full-core Rust + Rayon, BLAS/LAPACK-backed dense math, x86 SIMD, and Apple Silicon NEON kernels.
  • Wheels are built for manylinux_2_28 x86_64 and aarch64, plus macOS x86_64 and arm64.
  • Deterministic seeds, deterministic synthetic datasets, and committed benchmark artifacts make the claims inspectable.

End-to-end clustering pipeline time and quality tradeoff across deterministic benchmark families

🔁 From research repo to production rewrite

The original project matters because it proved the idea. clostera exists because that idea deserved a modern implementation.

Area Original pqkmeans clostera
Core implementation Older Python/C++ reference stack Rust core with PyO3 bindings and maturin packaging
PQ codebook initialization Basic point-picked initialization Deterministic PCA-quantile seeding with deterministic fallback
Cluster initialization Random center picking in PQ code space Deterministic farthest-first seeding in PQ code space
Quality modes Plain PQ Default OPQ-backed quality path plus an explicit fastest plain-PQ mode
Choosing K (number of clusters) User supplies K User supplies K or lets Rust-side auto-selection choose it with k=None
CPU path OpenMP-era reference implementation Rayon-parallel hot paths, BLAS/LAPACK-backed math, x86 SIMD, Apple Silicon NEON
Python workflows NumPy-centric NumPy arrays, parquet streaming, memmapped code output, RAM-bounded out-of-core workflows, deterministic synthetic datasets
Packaging Source build expectations manylinux_2_28 x86_64 and aarch64, macOS x86_64 and arm64, CPython 3.10 through 3.13
Benchmarking Research notebooks and limited comparison artifacts Deterministic benchmark suite with throughput and clustering-quality metrics, plots, and a showcase notebook

📊 The Benchmarks of Truth

The README carries committed, deterministic benchmarks because this project should win on numbers, not adjectives.

Large-scale checkpoint: 10,000,000 x 2048

This is the scale checkpoint the rewrite has to answer for: 64 clusters, one machine, and a dataset large enough that hand-waving stops being useful.

Thread settings used for the max-throughput configuration:

  • 24 BLAS threads
  • 24 OpenMP threads
  • 24 Rayon threads
Variant Encode s Cluster s Recon MSE Purity
original 222.94 80.19 0.15160 0.6573
clostera-fastest 7.24 4.50 0.12354 1.0000
clostera-quality 131.34 4.39 0.05494 1.0000

How to read that table:

  • clostera-fastest is the throughput configuration. It is the answer when raw encode speed matters most.
  • clostera-quality is the quality configuration. It spends more time on rotation but cuts reconstruction MSE by 2.25x versus clostera-fastest and by 2.76x versus the original implementation.
  • Even before OPQ, the Rust rewrite already beats the original implementation on both throughput and cluster quality.

10M by 2048 benchmark figure

K sweep: how the number of clusters changes runtime

We also ran a deterministic K sweep on the same 200k x 2048 block-mixed family used in the benchmark suite. Here K means the number of clusters. This isolates the clustering stage: each implementation trains and encodes once, then we sweep K = 16, 32, 64, 128, 256 over the same PQ codes.

K (number of clusters) original cluster s clostera-fastest cluster s original / clostera-fastest speedup
16 1.088 0.047 22.92x
32 1.404 0.064 21.83x
64 1.488 0.111 13.43x
128 1.597 0.205 7.80x
256 1.646 0.315 5.22x

What this sweep says:

  • The original implementation slows steadily as K rises and stays well behind clostera-fastest at every point in the published sweep.
  • The important point is not just the ranking. It is that clostera-fastest keeps clustering comfortably sub-second through K = 256 clusters on 200k x 2048, while the original implementation stays well above the one-second mark.

Clustering time versus K (number of clusters) on deterministic block mixed data

N sweep: how runtime scales with dataset size

We also fixed the algorithm configuration at K = 64 clusters, M = 64, Ks = 64 and swept the deterministic 2048-dimensional block-mixed dataset from 50k to 800k rows. Each point below uses a 16,384-row warm-up and reports the median of 3 timing runs, so the curve reflects steady-state runtime rather than first-call overhead.

N original encode s clostera-fastest encode s Encode speedup original cluster s clostera-fastest cluster s Cluster speedup
50k 0.680 0.037 18.39x 0.295 0.032 9.11x
100k 1.925 0.073 26.41x 0.602 0.057 10.64x
200k 3.697 0.145 25.47x 1.258 0.109 11.58x
400k 6.921 0.298 23.25x 2.851 0.185 15.41x
800k 12.873 0.641 20.09x 5.680 0.372 15.28x

What this sweep says:

  • Encode cost is close to linear in N for every implementation, but the slope is radically different: clostera-fastest holds roughly 1.25M to 1.54M vectors/s once the warm-up is out of the way, while the original implementation stays near 52k to 74k vectors/s.
  • At fixed K = 64 clusters, clustering also scales cleanly with dataset size. clostera-fastest stays about 9x to 15x faster than the original implementation across the full sweep.
  • The main point for capacity planning is that scaling by N looks predictable, not erratic. That matters when you are extrapolating from pilot runs to hundreds of millions or billions of vectors.

Encoding and clustering time versus dataset size on deterministic block mixed data

Distribution suite: speed and quality across different data families

We do not benchmark on one flattering Gaussian and declare victory. The committed suite now runs deterministic 10M-vector workloads for:

  • Gaussian data
  • anisotropic Gaussian data
  • Student-t heavy-tailed data
  • block-mixed 2048-dimensional data

For each scenario we track:

  • encode throughput
  • clustering throughput
  • reconstruction MSE
  • purity
  • adjusted Rand index
  • normalized mutual information
  • v-measure
  • assigned-center MSE

Across the suite:

  • clostera-fastest improves encode throughput over the original implementation by 25.35x to 32.72x.
  • clostera-quality reduces reconstruction error by 2.40x to 3.74x relative to clostera-fastest.
  • on end-to-end pipeline time, clostera-quality is faster than the original implementation on every committed 10M-vector suite scenario.
  • the original implementation is slower and has visibly worse clustering quality on every committed scenario.

Reconstruction error across deterministic datasets

Clustering purity across deterministic datasets

🍏 Apple Silicon is a first-class target

Modern ARM machines are not a side quest. clostera treats them like real production hardware.

  • aarch64 uses native NEON distance kernels for the common PQ subvector sizes 4, 8, 16, 32, and 64.
  • The PQ assignment path is no longer “build a buffer and scan it later”. It now uses a fused lookup-accumulate-and-select kernel plus SIMD-backed argmin, which matters on Apple Silicon because clustering on PQ codes is often dominated by assignment rather than raw distance evaluation.
  • The release workflow builds macOS arm64 wheels alongside x86_64 wheels.
  • The same wheel matrix also covers manylinux_2_28 x86_64 and aarch64.
  • The release configuration uses openblas-static so published wheels are as self-contained as practical.

If you are running on Apple Silicon, this is not a Rosetta fallback story. There is architecture-specific code in the hot path and packaging support in the release pipeline.

🔧 Under the hood: better initialization, less luck

One of the quietest but most important differences from the original repository is that clostera treats initialization like a real engineering problem instead of a footnote.

  • PQEncoder uses deterministic PCA-quantile initialization per subspace, rather than hoping random point picks land in a good configuration.
  • PQKMeans uses deterministic farthest-first seeding in PQ code space for better initial coverage.
  • The default quality path refines an orthogonal rotation before final PQ training, which is where the large OPQ quality gains come from on correlated high-dimensional data.

That shows up as more stable training, fewer pathological runs, and better quality at the same code budget. The headline speedups are not coming from luckier random seeds.

Installation

PyPI

pip install clostera

Optional extras:

pip install "clostera[benchmarks]"
pip install "clostera[notebook]"

Build from source

System BLAS/LAPACK build:

python -m pip install maturin
python -m maturin develop --release

Static OpenBLAS build:

python -m maturin develop --release --no-default-features --features openblas-static

More common workflows

Simple workflow

import numpy as np
import clostera

rng = np.random.default_rng(7)
vectors = rng.normal(size=(100_000, 128)).astype(np.float32)

clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(vectors)

print(clusterer.selected_k_)  # selected K = selected number of clusters

Known number-of-clusters (K) workflow

clusterer = clostera.Clusterer(k=known_k)  # known_k = desired number of clusters
labels = clusterer.fit_transform(vectors)

Fastest throughput workflow with a known number of clusters (K)

clusterer = clostera.Clusterer(k=known_k, fastest=True)  # known_k = desired number of clusters
labels = clusterer.fit_transform(vectors)

Predict on new vectors

clusterer = clostera.Clusterer(k=known_k)  # known_k = desired number of clusters
clusterer.fit(vectors)
labels = clusterer.transform(vectors[:1024])

Parquet workflow

clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform("vectors.parquet")

Out-of-core raw-vector workflow

When the original float vectors do not fit in RAM, pass a parquet path or a numpy.memmap-backed matrix and set max_ram_bytes.

clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(
    "vectors.parquet",
    max_ram_bytes=8 << 30,
)

With max_ram_bytes, clostera keeps the training sample bounded, streams raw vectors in batches during encoding, and automatically spills PQ codes to a temporary memmap when needed. The raw vector matrix no longer needs to fit in RAM all at once. If you already materialized the data as a normal in-memory ndarray, clostera can only bound its own additional working set; for truly out-of-core runs, use parquet or numpy.memmap.

Advanced API

Most users should start with Clusterer. The lower-level building blocks are still available when you want to:

  • reuse encoded PQ codes across many clustering runs
  • fit encoders and clusterers separately
  • switch explicitly between plain PQ and OPQ
  • tune encoder-specific and clusterer-specific parameters independently

Use Clusterer(fastest=True) when you want the fastest high-level path. Use plain PQEncoder and PQKMeans when you need that same plain-PQ behavior with explicit control. Use OPQEncoder and OPQMeans when reconstruction fidelity matters more and the data has strong cross-subspace correlation.

If you omit num_subquantizers, clostera infers a sensible default from the input dimensionality. For typical embeddings that lands near sqrt(D) code bytes while keeping each subvector wide enough to stay stable.

encoder = clostera.PQEncoder()
encoder.fit(vectors)
codes = encoder.transform(
    vectors,
)

clusterer = clostera.PQKMeans(encoder=encoder, k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(codes)

Showcase notebook

The repository includes a walkthrough notebook designed for readers who want the full visual story:

The committed notebook embeds its static figures directly, so the visuals render in GitHub and standalone notebook viewers without depending on external image paths.

It covers:

  • the high-level Clusterer workflow
  • automatic number-of-clusters (K) selection with k=None
  • parquet workflows
  • toy clustering visualization
  • plain PQ versus OPQ reconstruction quality
  • the advanced encoder/clusterer split when you need it
  • cross-dataset benchmark comparisons
  • the large-scale 10M x 2048 checkpoint
  • K (number of clusters) and N scaling sweeps

Parameter reference

In the API tables below, PathLike means a plain path string or a pathlib.Path object.

Clusterer

Clusterer is the default high-level API. It hides the encoder/clusterer split and gives the common workflow a simple fit, transform, fit_transform, fit_predict, and predict surface. By default it uses the quality-first OPQ path; pass fastest=True when you want the maximum-throughput plain-PQ path instead.

Parameter Type Default Meaning
k int | None None Number of target clusters. Here K means the number of clusters. None enables automatic number-of-clusters selection.
fastest bool False Turn off OPQ and use the maximum-throughput plain-PQ path. This usually lowers reconstruction quality but can reduce total fit time substantially on large runs.
num_subquantizers int | None None Optional PQ subspace count. When omitted, clostera infers a deterministic default from the input dimensionality.
codebook_size int 256 Number of codewords per subspace.
iterations int 20 Shared iteration budget for the simple high-level API.
seed int 0 Deterministic seed.
opq_iterations int 3 OPQ refinement steps used on the default quality-first path. When fastest=True, the current code always uses plain PQ and ignores this setting.
verbose bool False Emit inertia diagnostics during fitting.
lookup_table_bytes int 1 << 30 Memory budget for code-domain lookup tables. Larger budgets favor faster assignment.
auto_k_method str "centroid_silhouette" Automatic-number-of-clusters (K) scoring rule. Supported values are "centroid_silhouette", "davies_bouldin", "elbow", and "bic".
auto_k_candidates list[int] | tuple[int, ...] | np.ndarray | None None Explicit candidate K values, meaning candidate cluster counts, to test when k=None. If omitted, clostera builds a default candidate template automatically, including practical values such as 4, 6, 8, 12, 16, 24, and 32 when the dataset size supports them.
auto_k_min int 2 Lower bound for automatically generated candidate values when auto_k_candidates is omitted.
auto_k_max int | None None Upper bound for automatically generated candidate values when auto_k_candidates is omitted.
auto_k_step int | None None Optional arithmetic step for generated candidates. If omitted, clostera uses a baked-in candidate template.
auto_k_sample_rows int 16_384 Number of PQ codes sampled for the Rust-side candidate analysis pass.

Clusterer.fit(...), transform(...), fit_transform(...), fit_predict(...), predict(...)

Parameter Type Default Meaning
data np.ndarray | PathLike required Raw float vectors as an array, parquet path, or numpy.memmap-backed matrix.
parquet_column str | None None Specific parquet vector column.
batch_size int 65_536 Parquet streaming batch size.
codes_output_path PathLike | None None Optional memmap destination when raw parquet input must be encoded first.
max_ram_bytes int | None None Optional RAM budget for bounded-memory raw-vector workflows.

Advanced access after fitting:

  • encoder_: the fitted PQEncoder or OPQEncoder
  • clusterer_: the fitted PQKMeans or OPQMeans
  • labels_, cluster_centers_, inertia_history_, selected_k_, k_selection_

Advanced low-level API

The classes below expose the encoder/clusterer split directly. Reach for them when you want to reuse PQ codes, separate training phases, or tune encoder-specific and clusterer-specific parameters independently.

PQEncoder

Parameter Type Default Meaning
num_subquantizers int | None None Number of PQ subspaces M. When omitted, clostera infers a deterministic default from the input dimensionality. Explicit values still require the dimensionality to be divisible by M.
codebook_size int 256 Number of codewords per subspace Ks. Supported range is 2..=256.
iterations int 20 Number of Lloyd iterations for subspace k-means training.
seed int 0 Deterministic seed used for initialization fallback and reproducible training behavior.
opq_iterations int 0 Number of OPQ refinement steps. 0 keeps plain PQ, >0 learns an orthogonal rotation before final PQ training.

OPQEncoder

OPQEncoder has the same API and runtime methods as PQEncoder, but defaults opq_iterations to 3.

PQEncoder.fit(...)

Parameter Type Default Meaning
data np.ndarray | PathLike required A dense float32 matrix or a parquet path.
parquet_column str | None None Specific parquet column to treat as the vector column.
batch_size int 65_536 Batch size for parquet streaming.
train_rows int | None None Number of deterministic training rows to sample. With in-memory arrays, omitting this uses the full matrix unless max_ram_bytes is set.
max_ram_bytes int | None None Optional RAM budget for the training sample plus OPQ workspace. When set, large parquet or memmap-backed inputs are trained from a bounded deterministic sample.

PQEncoder.transform(...)

Parameter Type Default Meaning
data np.ndarray | PathLike required Dense vectors or parquet input.
parquet_column str | None None Specific parquet vector column.
batch_size int 65_536 Parquet streaming batch size.
output_path PathLike | None None Optional destination for a memory-mapped uint8 code matrix.
max_ram_bytes int | None None Optional RAM budget for batched encoding. Large raw-vector inputs are processed in chunks; if codes would not fit in RAM, provide output_path or call PQKMeans.fit(...) directly.

PQEncoder.fit_transform(...)

Parameter Type Default Meaning
data np.ndarray | PathLike required A dense float32 matrix or a parquet path.
parquet_column str | None None Specific parquet column to treat as the vector column.
batch_size int 65_536 Parquet streaming batch size.
train_rows int | None None Number of deterministic training rows to sample before encoding.
output_path PathLike | None None Optional destination for a memory-mapped uint8 code matrix produced by the transform phase.
max_ram_bytes int | None None Optional RAM budget applied to both training and encoding.

PQEncoder.inverse_transform(...)

Parameter Type Default Meaning
codes np.ndarray required A 2D PQ code matrix with shape (rows, num_subquantizers). Returns decoded float32 vectors.

PQKMeans

Parameter Type Default Meaning
encoder PQEncoder required Trained encoder that defines the codebooks.
k int | None None Number of target clusters. Here K means the number of clusters. None enables Rust-side automatic number-of-clusters selection over candidate values in PQ code space.
iterations int 20 Number of clustering update rounds.
seed int 0 Deterministic seed for cluster-center initialization.
verbose bool False Emit inertia diagnostics during fitting.
lookup_table_bytes int 1 << 30 Memory budget for code-domain lookup tables. Larger budgets favor faster assignment.
auto_k_method str "centroid_silhouette" Automatic-number-of-clusters (K) scoring rule. Supported values are "centroid_silhouette", "davies_bouldin", "elbow", and "bic".
auto_k_candidates list[int] | tuple[int, ...] | np.ndarray | None None Explicit candidate K values, meaning candidate cluster counts, to test when k=None. If omitted, clostera builds a default candidate template automatically, including practical values such as 4, 6, 8, 12, 16, 24, and 32 when the dataset size supports them.
auto_k_min int 2 Lower bound for automatically generated candidate values when auto_k_candidates is omitted.
auto_k_max int | None None Upper bound for automatically generated candidate values when auto_k_candidates is omitted.
auto_k_step int | None None Optional arithmetic step for generated candidates. If omitted, clostera uses a baked-in candidate template.
auto_k_sample_rows int 16_384 Number of PQ codes sampled for the Rust-side candidate analysis pass.

OPQMeans

OPQMeans mirrors PQKMeans, but treats OPQ as the default rather than an extra knob. If you do not pass encoder=, it lazily creates and fits an OPQEncoder from the raw vectors or parquet source on first fit(...), fit_predict(...), or fit_transform(...). If you do pass encoder=, the current code requires it to have been trained with opq_iterations > 0.

Parameter Type Default Meaning
encoder PQEncoder | None None Optional pre-trained OPQ encoder. If omitted, OPQMeans builds one automatically.
num_subquantizers int | None None Optional encoder-side PQ subspace count when encoder is omitted.
codebook_size int 256 Optional encoder-side codebook size when encoder is omitted.
encoder_iterations int 20 Encoder training iterations used when encoder is omitted.
seed int 0 Deterministic seed shared by the implicit encoder and the clusterer.
opq_iterations int 3 OPQ refinement steps used by the implicit encoder.
k int | None None Number of target clusters. Here K means the number of clusters. None enables Rust-side automatic number-of-clusters selection over candidate values in PQ code space.
iterations int 20 Number of clustering update rounds.
verbose bool False Emit inertia diagnostics during fitting.
lookup_table_bytes int 1 << 30 Memory budget for code-domain lookup tables. Larger budgets favor faster assignment.
auto_k_method str "centroid_silhouette" Automatic-number-of-clusters (K) scoring rule. Supported values are "centroid_silhouette", "davies_bouldin", "elbow", and "bic".
auto_k_candidates list[int] | tuple[int, ...] | np.ndarray | None None Explicit candidate K values, meaning candidate cluster counts, to test when k=None. If omitted, clostera builds a default candidate template automatically, including practical values such as 4, 6, 8, 12, 16, 24, and 32 when the dataset size supports them.
auto_k_min int 2 Lower bound for automatically generated candidate values when auto_k_candidates is omitted.
auto_k_max int | None None Upper bound for automatically generated candidate values when auto_k_candidates is omitted.
auto_k_step int | None None Optional arithmetic step for generated candidates. If omitted, clostera uses a baked-in candidate template.
auto_k_sample_rows int 16_384 Number of PQ codes sampled for the Rust-side candidate analysis pass.

OPQMeans uses the same runtime method signatures as PQKMeans: fit(...), transform(...), fit_transform(...), fit_predict(...), and predict(...).

PQKMeans.fit(...), transform(...), fit_transform(...), fit_predict(...), predict(...)

Parameter Type Default Meaning
data np.ndarray | PathLike required Either raw vectors or precomputed PQ codes.
parquet_column str | None None Specific parquet vector column.
batch_size int 65_536 Parquet streaming batch size.
codes_output_path PathLike | None None Optional memmap destination when raw parquet input must be encoded first.
max_ram_bytes int | None None Optional RAM budget for encoding raw vectors into PQ codes before clustering. When set and no codes_output_path is supplied, clostera creates a temporary memmap automatically.

When k=None, fitting also populates:

  • selected_k_: the final chosen cluster count (K)
  • k_selection_: the full Rust-side selection report, including the tested candidate values and per-method scores

Advanced runtime knob

Environment variable Meaning
CLOSTERA_ROTATION_BATCH_MIB Override the default OPQ rotation batch target in MiB for benchmarking or machine-specific tuning.

Reproducing the benchmark artifacts

Generate a deterministic synthetic dataset

python scripts/generate_synthetic_dataset.py \
  --output-dir .artifacts/block-mixed-200k-2048 \
  --distribution block_mixed \
  --rows 200000 \
  --dim 2048 \
  --clusters 64 \
  --seed 11

Compare the original repo and clostera

python scripts/compare_impls.py \
  --dataset-dir .artifacts/block-mixed-200k-2048 \
  --original-python "$(which python)" \
  --enhanced-python "$(which python)" \
  --train-rows 32768 \
  --metric-sample-rows 32768 \
  --num-subquantizers 64 \
  --codebook-size 64 \
  --pq-iterations 6 \
  --cluster-k 64 \
  --cluster-iterations 4 \
  --opq-iterations 3 \
  --blas-threads 24 \
  --omp-threads 24 \
  --rayon-threads 24 \
  --rotation-batch-mib 32 \
  --output-json .artifacts/block-mixed-200k-2048/compare.json

Run the K (number of clusters) sweep

python scripts/benchmark_k_sweep.py \
  --dataset-dir .artifacts/k-sweep-block-mixed-200k-2048 \
  --output-json benchmarks/results/k-sweep.json \
  --original-python "$(which python)" \
  --enhanced-python "$(which python)" \
  --force

Run the N sweep

python scripts/benchmark_n_sweep.py \
  --dataset-dir .artifacts/n-sweep-block-mixed-800k-2048 \
  --output-json benchmarks/results/n-sweep.json \
  --original-python "$(which python)" \
  --enhanced-python "$(which python)" \
  --force

Run the full deterministic distribution suite

python scripts/benchmark_suite.py \
  --output-dir .artifacts/benchmark-suite \
  --original-python "$(which python)" \
  --enhanced-python "$(which python)" \
  --blas-threads 24 \
  --omp-threads 24 \
  --rayon-threads 24 \
  --rotation-batch-mib 32 \
  --force

Run the automatic number-of-clusters (K) selection sweep

python scripts/evaluate_auto_k_methods.py \
  --output-json benchmarks/results/auto-k-methods.json \
  --force

Render the README and notebook figures

python scripts/render_benchmark_assets.py \
  --suite-json benchmarks/results/benchmark-suite.json \
  --large-json benchmarks/results/large-scale-10m.json \
  --k-sweep-json benchmarks/results/k-sweep.json \
  --n-sweep-json benchmarks/results/n-sweep.json \
  --auto-k-json benchmarks/results/auto-k-methods.json \
  --output-dir docs/assets

Packaging and release

The repository already includes publication artifacts for:

  • manylinux_2_28 wheels for x86_64 and aarch64
  • macOS wheels for x86_64 and arm64
  • CPython 3.10 through 3.13
  • source distributions

Relevant files:

  • .github/workflows/ci.yml
  • .github/workflows/release.yml
  • rust-toolchain.toml

The release workflow builds wheels with openblas-static enabled so binary installs are as self-contained as practical.

Releasing to PyPI

The PyPI project name is clostera.

Once the one-time PyPI Trusted Publisher setup is done for:

  • owner: BaseModelAI
  • repository: clostera
  • workflow: .github/workflows/release.yml
  • environment: pypi

the normal release path is:

python scripts/release.py 1.0.1 --commit --tag --push

That updates the version in the release metadata, creates the release commit, creates tag v1.0.1, and pushes both to origin. The tag push triggers the GitHub release workflow, which builds the wheels and publishes them to PyPI.

Original project and related work

Original implementation

Core papers behind this repo

Useful related reading

Verification

Current local verification commands:

python -m maturin develop --release
cargo test --release
pytest -q
cargo check --no-default-features --features openblas-static
cargo bench --bench core_bench

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clostera-1.0.0.tar.gz (5.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

clostera-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

clostera-1.0.0-cp313-cp313-manylinux_2_28_aarch64.whl (692.6 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

clostera-1.0.0-cp313-cp313-macosx_11_0_arm64.whl (654.6 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

clostera-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

clostera-1.0.0-cp312-cp312-manylinux_2_28_aarch64.whl (693.0 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

clostera-1.0.0-cp312-cp312-macosx_11_0_arm64.whl (655.2 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

clostera-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

clostera-1.0.0-cp311-cp311-manylinux_2_28_aarch64.whl (696.7 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

clostera-1.0.0-cp311-cp311-macosx_11_0_arm64.whl (658.6 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

clostera-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

clostera-1.0.0-cp310-cp310-manylinux_2_28_aarch64.whl (694.0 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ ARM64

clostera-1.0.0-cp310-cp310-macosx_11_0_arm64.whl (658.5 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file clostera-1.0.0.tar.gz.

File metadata

  • Download URL: clostera-1.0.0.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b4885286169041cc0d3c8309930915398ddcdb9ca13b46d5a4b7c0ceebf9cd39
MD5 84d43fa744c668512c4e02f968780a16
BLAKE2b-256 16220e304b1589874b65ccbe1060f8b5d57537d29304b27ae33f147a053f7b3e

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0.tar.gz:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clostera-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for clostera-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 094f27a143aae58309af0812646f9cc7b886258faa22501e8530731d721625e7
MD5 1a7ebb77513c41bd8f8fdc3413d4472d
BLAKE2b-256 5afe22f88b73d1272a700cf912a1407aef64cfd03333fe47062cbc07da6d8dc5

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clostera-1.0.0-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for clostera-1.0.0-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 ac0dc98b28a97276d18528a705603726d2a8fe497e4fdf251e773ec18266d33c
MD5 b92a6cec4d14837c4fcc25dc5b922644
BLAKE2b-256 028f2d7a373a082bdf757a09ff98e28059d4494eb20d12701f3a43cc48860932

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp313-cp313-manylinux_2_28_aarch64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clostera-1.0.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for clostera-1.0.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1dc3ff9a261ad15d25dc1df4a46e3dfae30d73ce3e902bf970548abc9138321a
MD5 c9e2063974a53ea7dac3917133f3331b
BLAKE2b-256 e3859008ea7bf81f557f56b638dfe9a4b22a895128bdf9c066f5e7b51019fbf5

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clostera-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for clostera-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ff1240ffcd6856a4c8134b10318a69a2e52a6044eb561ac1cbbc20df3a811b15
MD5 e719da0160ad53043f20735a2631d51d
BLAKE2b-256 700c8cdfb1aeab684c9111e41ff9ad8a71b32b7b88689194ce9cee28e21d68ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clostera-1.0.0-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for clostera-1.0.0-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8b86073ca57aa7011ba7fe19ebd7318d9c7e06876003bdfe6cbdbcf6297ba8bb
MD5 66694fd1624ee2d2b2c0ef5f93e3b7de
BLAKE2b-256 d814b4101307c3894d5f7744e3371cebab458c354a96b14e7b066ddee4360b89

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp312-cp312-manylinux_2_28_aarch64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clostera-1.0.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for clostera-1.0.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 52251121ed33cfc3edd43967574f63f253b4922406b029607de7695108f6119c
MD5 fbd570b0c6fa842ea0a7b166fd03cc71
BLAKE2b-256 43bbd8f6d6a8ec08b72f066f6dfa2d13e2e3af9c580d209b8a4d2c818f5d0a9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clostera-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for clostera-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4282f75e432827e6facf5cf79a3651f18eab7a2b6caa9fec48cad9aecdb11b02
MD5 1ff76fb7584ee80ac41cbdd724390471
BLAKE2b-256 ee57a0e7096e159cc544fe3cf63692be9dfd8758e06e08cf021617ffa8bcd425

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clostera-1.0.0-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for clostera-1.0.0-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 2c64c4095577e48aed9d520c29630a4a25d7864f08dadb600a4a65eca3474817
MD5 a88f5f48c0053b316c67e1e77a95ab86
BLAKE2b-256 f01403b86dc5346fbd2fd20c3301c6a962b5847c7b49fe1cdbfe6d5c8e515e5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp311-cp311-manylinux_2_28_aarch64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clostera-1.0.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for clostera-1.0.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bed2978f75668110fb944137dc181a095c5d2563285def96954790da94d1ad1d
MD5 f8c6bff4d5b6fd059c50d596d55c10d0
BLAKE2b-256 7429d498ead088c14614c78690a5a91fe9166cb9fcf8ffccc9e69101c1ec2623

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clostera-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for clostera-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5540393c33950f32d8a28589d99be01ad8de2936d8ebd66a706a13e661e7042d
MD5 59750712c5544baef88cc2b34d09fa05
BLAKE2b-256 71774c4bc3c48a4a468aec4049c7f3be9e7b3e3cd87c6235bc18693810ff28a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clostera-1.0.0-cp310-cp310-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for clostera-1.0.0-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8a5c91b6c521e75be3955c866ee1eb0160f51c97d1c3109605b7eb3f4afbec8a
MD5 dafd181e7a9bf9d57c5d784b0d7731a3
BLAKE2b-256 30ebdf9f727a0ed505d419c5675a1c30d3b8e0b18fbfa247b42baed4d7fc42ea

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp310-cp310-manylinux_2_28_aarch64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file clostera-1.0.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for clostera-1.0.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 931f7fcaa348b1f6eadddce32f7c55f7b91ae176f5d1e0b58fc703b22a3215a1
MD5 c0979ed4eec0d1b3313ada1e15a40235
BLAKE2b-256 d6cd0ac78ffadc3023300234ea59b4dd6a6be41c791b7da25a15461ab7c719dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page