Modern Rust rewrite of the original pqkmeans project for large-scale clustering with numpy and parquet workflows

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ponythewhite

These details have not been verified by PyPI

Project description

clostera: The Billion-Vector Resurrection

clostera benchmark summary

They told you that clustering massive high-dimensional vector collections on a single machine was a fool's errand. They said you needed a cluster, a distributed headache, and a cloud bill large enough to ruin your week. They were wrong.

clostera is a from-scratch Rust rebuild of the original pqkmeans repository, aimed at the workloads that made that project exciting in the first place: extremely large vector collections, high dimensionality, single-machine practicality, and performance that is measured rather than hoped for.

This is not a thin wrapper around old code. It is a modern rewrite with a new Rust core, a NumPy-first Python layer, parquet and out-of-core workflows, deterministic benchmarks, automatic number-of-clusters (K) selection, Apple Silicon support, and wheels that install like a normal Python package.

Rust core • Rayon • OpenBLAS/LAPACK • AVX2/SSE • Apple Silicon NEON • NumPy + parquet • manylinux + macOS wheels

pip install clostera

⚡️ Quick Start: It just works

The zero-tuning path

import numpy as np
import clostera

vectors = np.load("vectors.npy").astype(np.float32)
clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(vectors)

print(clusterer.selected_k_)  # selected K = selected number of clusters

That is the default story: one object, raw vectors in, labels out, OPQ-enabled quality path by default, and automatic number-of-clusters (K) selection when you do not know the answer up front.

The fastest path

clusterer = clostera.Clusterer(k=256, fastest=True)  # K = number of clusters
labels = clusterer.fit_transform(vectors)

fastest=True turns off OPQ and uses the plain PQ path. That is the right choice when end-to-end throughput matters more than reconstruction quality. The main speed win is in encoder training and encoding; the final compressed assignment stage itself is already fast in both modes.

Out-of-core from parquet

clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform("vectors.parquet")

If the original float vectors do not fit comfortably in RAM, add max_ram_bytes=.... If they do fit, you do not need to think about it.

⚡️ The Miracle of 30.8x: Bending Time

The original repository proved a powerful idea: by clustering in PQ code space instead of dense float space, single-machine clustering suddenly stops sounding ridiculous. That idea aged well. The surrounding implementation did not.

clostera asks the obvious follow-up question:

what happens if you rebuild the original pqkmeans project properly for modern hardware and modern Python workflows?

On the committed deterministic 10M x 2048 checkpoint, the answer is not subtle.

Metric (`10M x 2048`)	original	`clostera-fastest`	`clostera-quality`
Encode time	`222.94s`	`7.24s`	`131.34s`
Cluster time	`80.19s`	`4.50s`	`4.39s`
Reconstruction MSE	`0.15160`	`0.12354`	`0.05494`
Purity	`0.6573`	`1.0000`	`1.0000`

That means:

30.8x faster encoding than the original implementation on the headline checkpoint.
17.8x faster clustering on the same full-core run.
Better clustering quality even on the fastest path.
A quality-first OPQ mode that dramatically lowers reconstruction error when fidelity matters more than raw throughput.

10M by 2048 benchmark figure

💾 The Alchemy of Memory: Zero-RAM Scaling

At billion-vector scale, the algorithm is only half the story. Memory movement is usually the real bottleneck.

clostera is built around that reality:

raw numpy.ndarray input works out of the box
parquet is a first-class input format
fixed-size-list vector columns and plain numeric scalar columns are both supported
max_ram_bytes bounds the working set when the original float vectors do not fit
raw vectors can be streamed while PQ codes spill to disk automatically when needed
numpy.memmap fits naturally into the same workflow

This is the practical difference between a paper result and a pipeline you can actually operate.

A 2D example using k-means, clostera-quality, and clostera-fastest

2D comparison of k-means, clostera-quality, and clostera-fastest

Large-scale evaluation

Large-scale evaluation summary table

🧠 The Oracle of K: Automatic number of clusters without guesswork

Choosing K (the number of clusters) used to mean elbow plots, trial-and-error, and pretending you were more certain than you really were.

clostera lets you pass k=None to Clusterer, PQKMeans, or OPQMeans when you do not know the number of clusters in advance. The candidate analysis runs in Rust, reuses the already-trained encoder and the already-encoded PQ code matrix, and does not regenerate the expensive intermediate artifacts for each candidate number of clusters (K).

On the committed deterministic benchmark sweep, the default centroid_silhouette selector recovered the exact true cluster count in 20/20 cases.

centroid_silhouette: 20/20 exact matches, 0.00 mean absolute error
davies_bouldin: 18/20 exact matches, 0.90 mean absolute error
elbow: 18/20 exact matches, 1.60 mean absolute error
bic: 3/20 exact matches, 50.40 mean absolute error

Automatic number-of-clusters (K) selection benchmark figure

💎 The Obsidian Core: Engineered for modern silicon

clostera is built for people who care about practical speed, reproducibility, and a sane deployment story.

Clusterer is the simple default API for normal use.
fastest=True gives you the maximum-throughput plain-PQ path.
The default path keeps OPQ on and favors quality.
The advanced split into PQEncoder / PQKMeans and OPQEncoder / OPQMeans is still there when you need it.
The hot paths use full-core Rust + Rayon, BLAS/LAPACK-backed dense math, x86 SIMD, and Apple Silicon NEON kernels.
Wheels are built for manylinux_2_28 x86_64 and aarch64, plus macOS x86_64 and arm64.
Deterministic seeds, deterministic synthetic datasets, and committed benchmark artifacts make the claims inspectable.

End-to-end clustering pipeline time and quality tradeoff across deterministic benchmark families

🔁 From research repo to production rewrite

The original project matters because it proved the idea. clostera exists because that idea deserved a modern implementation.

Area	Original `pqkmeans`	`clostera`
Core implementation	Older Python/C++ reference stack	Rust core with `PyO3` bindings and `maturin` packaging
PQ codebook initialization	Basic point-picked initialization	Deterministic PCA-quantile seeding with deterministic fallback
Cluster initialization	Random center picking in PQ code space	Deterministic farthest-first seeding in PQ code space
Quality modes	Plain PQ	Default OPQ-backed quality path plus an explicit fastest plain-PQ mode
Choosing `K` (number of clusters)	User supplies `K`	User supplies `K` or lets Rust-side auto-selection choose it with `k=None`
CPU path	OpenMP-era reference implementation	Rayon-parallel hot paths, BLAS/LAPACK-backed math, x86 SIMD, Apple Silicon NEON
Python workflows	NumPy-centric	NumPy arrays, parquet streaming, memmapped code output, RAM-bounded out-of-core workflows, deterministic synthetic datasets
Packaging	Source build expectations	`manylinux_2_28` `x86_64` and `aarch64`, macOS `x86_64` and `arm64`, CPython `3.10` through `3.13`
Benchmarking	Research notebooks and limited comparison artifacts	Deterministic benchmark suite with throughput and clustering-quality metrics, plots, and a showcase notebook

📊 The Benchmarks of Truth

The README carries committed, deterministic benchmarks because this project should win on numbers, not adjectives.

Large-scale checkpoint: `10,000,000 x 2048`

This is the scale checkpoint the rewrite has to answer for: 64 clusters, one machine, and a dataset large enough that hand-waving stops being useful.

Thread settings used for the max-throughput configuration:

24 BLAS threads
24 OpenMP threads
24 Rayon threads

Variant	Encode s	Cluster s	Recon MSE	Purity
original	`222.94`	`80.19`	`0.15160`	`0.6573`
clostera-fastest	`7.24`	`4.50`	`0.12354`	`1.0000`
clostera-quality	`131.34`	`4.39`	`0.05494`	`1.0000`

How to read that table:

clostera-fastest is the throughput configuration. It is the answer when raw encode speed matters most.
clostera-quality is the quality configuration. It spends more time on rotation but cuts reconstruction MSE by 2.25x versus clostera-fastest and by 2.76x versus the original implementation.
Even before OPQ, the Rust rewrite already beats the original implementation on both throughput and cluster quality.

10M by 2048 benchmark figure

K sweep: how the number of clusters changes runtime

We also ran a deterministic K sweep on the same 200k x 2048 block-mixed family used in the benchmark suite. Here K means the number of clusters. This isolates the clustering stage: each implementation trains and encodes once, then we sweep K = 16, 32, 64, 128, 256 over the same PQ codes.

K (number of clusters)	original cluster s	clostera-fastest cluster s	original / clostera-fastest speedup
`16`	`1.088`	`0.047`	`22.92x`
`32`	`1.404`	`0.064`	`21.83x`
`64`	`1.488`	`0.111`	`13.43x`
`128`	`1.597`	`0.205`	`7.80x`
`256`	`1.646`	`0.315`	`5.22x`

What this sweep says:

The original implementation slows steadily as K rises and stays well behind clostera-fastest at every point in the published sweep.
The important point is not just the ranking. It is that clostera-fastest keeps clustering comfortably sub-second through K = 256 clusters on 200k x 2048, while the original implementation stays well above the one-second mark.

Clustering time versus K (number of clusters) on deterministic block mixed data

N sweep: how runtime scales with dataset size

We also fixed the algorithm configuration at K = 64 clusters, M = 64, Ks = 64 and swept the deterministic 2048-dimensional block-mixed dataset from 50k to 800k rows. Each point below uses a 16,384-row warm-up and reports the median of 3 timing runs, so the curve reflects steady-state runtime rather than first-call overhead.

N	original encode s	clostera-fastest encode s	Encode speedup	original cluster s	clostera-fastest cluster s	Cluster speedup
`50k`	`0.680`	`0.037`	`18.39x`	`0.295`	`0.032`	`9.11x`
`100k`	`1.925`	`0.073`	`26.41x`	`0.602`	`0.057`	`10.64x`
`200k`	`3.697`	`0.145`	`25.47x`	`1.258`	`0.109`	`11.58x`
`400k`	`6.921`	`0.298`	`23.25x`	`2.851`	`0.185`	`15.41x`
`800k`	`12.873`	`0.641`	`20.09x`	`5.680`	`0.372`	`15.28x`

What this sweep says:

Encode cost is close to linear in N for every implementation, but the slope is radically different: clostera-fastest holds roughly 1.25M to 1.54M vectors/s once the warm-up is out of the way, while the original implementation stays near 52k to 74k vectors/s.
At fixed K = 64 clusters, clustering also scales cleanly with dataset size. clostera-fastest stays about 9x to 15x faster than the original implementation across the full sweep.
The main point for capacity planning is that scaling by N looks predictable, not erratic. That matters when you are extrapolating from pilot runs to hundreds of millions or billions of vectors.

Encoding and clustering time versus dataset size on deterministic block mixed data

Distribution suite: speed and quality across different data families

We do not benchmark on one flattering Gaussian and declare victory. The committed suite now runs deterministic 10M-vector workloads for:

Gaussian data
anisotropic Gaussian data
Student-t heavy-tailed data
block-mixed 2048-dimensional data

For each scenario we track:

encode throughput
clustering throughput
reconstruction MSE
purity
adjusted Rand index
normalized mutual information
v-measure
assigned-center MSE

Across the suite:

clostera-fastest improves encode throughput over the original implementation by 25.35x to 32.72x.
clostera-quality reduces reconstruction error by 2.40x to 3.74x relative to clostera-fastest.
on end-to-end pipeline time, clostera-quality is faster than the original implementation on every committed 10M-vector suite scenario.
the original implementation is slower and has visibly worse clustering quality on every committed scenario.

Reconstruction error across deterministic datasets

Clustering purity across deterministic datasets

🍏 Apple Silicon is a first-class target

Modern ARM machines are not a side quest. clostera treats them like real production hardware.

aarch64 uses native NEON distance kernels for the common PQ subvector sizes 4, 8, 16, 32, and 64.
The PQ assignment path is no longer “build a buffer and scan it later”. It now uses a fused lookup-accumulate-and-select kernel plus SIMD-backed argmin, which matters on Apple Silicon because clustering on PQ codes is often dominated by assignment rather than raw distance evaluation.
The release workflow builds macOS arm64 wheels alongside x86_64 wheels.
The same wheel matrix also covers manylinux_2_28 x86_64 and aarch64.
The release configuration uses openblas-static so published wheels are as self-contained as practical.

If you are running on Apple Silicon, this is not a Rosetta fallback story. There is architecture-specific code in the hot path and packaging support in the release pipeline.

🔧 Under the hood: better initialization, less luck

One of the quietest but most important differences from the original repository is that clostera treats initialization like a real engineering problem instead of a footnote.

PQEncoder uses deterministic PCA-quantile initialization per subspace, rather than hoping random point picks land in a good configuration.
PQKMeans uses deterministic farthest-first seeding in PQ code space for better initial coverage.
The default quality path refines an orthogonal rotation before final PQ training, which is where the large OPQ quality gains come from on correlated high-dimensional data.

That shows up as more stable training, fewer pathological runs, and better quality at the same code budget. The headline speedups are not coming from luckier random seeds.

Installation

PyPI

pip install clostera

Optional extras:

pip install "clostera[benchmarks]"
pip install "clostera[notebook]"

Build from source

System BLAS/LAPACK build:

python -m pip install maturin
python -m maturin develop --release

Static OpenBLAS build:

python -m maturin develop --release --no-default-features --features openblas-static

More common workflows

Simple workflow

import numpy as np
import clostera

rng = np.random.default_rng(7)
vectors = rng.normal(size=(100_000, 128)).astype(np.float32)

clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(vectors)

print(clusterer.selected_k_)  # selected K = selected number of clusters

Known number-of-clusters (`K`) workflow

clusterer = clostera.Clusterer(k=known_k)  # known_k = desired number of clusters
labels = clusterer.fit_transform(vectors)

Fastest throughput workflow with a known number of clusters (`K`)

clusterer = clostera.Clusterer(k=known_k, fastest=True)  # known_k = desired number of clusters
labels = clusterer.fit_transform(vectors)

Predict on new vectors

clusterer = clostera.Clusterer(k=known_k)  # known_k = desired number of clusters
clusterer.fit(vectors)
labels = clusterer.transform(vectors[:1024])

Parquet workflow

clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform("vectors.parquet")

Out-of-core raw-vector workflow

When the original float vectors do not fit in RAM, pass a parquet path or a numpy.memmap-backed matrix and set max_ram_bytes.

clusterer = clostera.Clusterer(k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(
    "vectors.parquet",
    max_ram_bytes=8 << 30,
)

With max_ram_bytes, clostera keeps the training sample bounded, streams raw vectors in batches during encoding, and automatically spills PQ codes to a temporary memmap when needed. The raw vector matrix no longer needs to fit in RAM all at once. If you already materialized the data as a normal in-memory ndarray, clostera can only bound its own additional working set; for truly out-of-core runs, use parquet or numpy.memmap.

Advanced API

Most users should start with Clusterer. The lower-level building blocks are still available when you want to:

reuse encoded PQ codes across many clustering runs
fit encoders and clusterers separately
switch explicitly between plain PQ and OPQ
tune encoder-specific and clusterer-specific parameters independently

Use Clusterer(fastest=True) when you want the fastest high-level path. Use plain PQEncoder and PQKMeans when you need that same plain-PQ behavior with explicit control. Use OPQEncoder and OPQMeans when reconstruction fidelity matters more and the data has strong cross-subspace correlation.

If you omit num_subquantizers, clostera infers a sensible default from the input dimensionality. For typical embeddings that lands near sqrt(D) code bytes while keeping each subvector wide enough to stay stable.

encoder = clostera.PQEncoder()
encoder.fit(vectors)
codes = encoder.transform(
    vectors,
)

clusterer = clostera.PQKMeans(encoder=encoder, k=None)  # choose the number of clusters (K) automatically
labels = clusterer.fit_transform(codes)

Showcase notebook

The repository includes a walkthrough notebook designed for readers who want the full visual story:

notebooks/clostera_showcase.ipynb

The committed notebook embeds its static figures directly, so the visuals render in GitHub and standalone notebook viewers without depending on external image paths.

It covers:

the high-level Clusterer workflow
automatic number-of-clusters (K) selection with k=None
parquet workflows
toy clustering visualization
plain PQ versus OPQ reconstruction quality
the advanced encoder/clusterer split when you need it
cross-dataset benchmark comparisons
the large-scale 10M x 2048 checkpoint
K (number of clusters) and N scaling sweeps

Parameter reference

In the API tables below, PathLike means a plain path string or a pathlib.Path object.

`Clusterer`

Clusterer is the default high-level API. It hides the encoder/clusterer split and gives the common workflow a simple fit, transform, fit_transform, fit_predict, and predict surface. By default it uses the quality-first OPQ path; pass fastest=True when you want the maximum-throughput plain-PQ path instead.

Parameter	Type	Default	Meaning
`k`	`int \| None`	`None`	Number of target clusters. Here `K` means the number of clusters. `None` enables automatic number-of-clusters selection.
`fastest`	`bool`	`False`	Turn off OPQ and use the maximum-throughput plain-PQ path. This usually lowers reconstruction quality but can reduce total fit time substantially on large runs.
`num_subquantizers`	`int \| None`	`None`	Optional PQ subspace count. When omitted, `clostera` infers a deterministic default from the input dimensionality.
`codebook_size`	`int`	`256`	Number of codewords per subspace.
`iterations`	`int`	`20`	Shared iteration budget for the simple high-level API.
`seed`	`int`	`0`	Deterministic seed.
`opq_iterations`	`int`	`3`	OPQ refinement steps used on the default quality-first path. When `fastest=True`, the current code always uses plain PQ and ignores this setting.
`verbose`	`bool`	`False`	Emit inertia diagnostics during fitting.
`lookup_table_bytes`	`int`	`1 << 30`	Memory budget for code-domain lookup tables. Larger budgets favor faster assignment.
`auto_k_method`	`str`	`"centroid_silhouette"`	Automatic-number-of-clusters (`K`) scoring rule. Supported values are `"centroid_silhouette"`, `"davies_bouldin"`, `"elbow"`, and `"bic"`.
`auto_k_candidates`	`list[int] \| tuple[int, ...] \| np.ndarray \| None`	`None`	Explicit candidate `K` values, meaning candidate cluster counts, to test when `k=None`. If omitted, `clostera` builds a default candidate template automatically, including practical values such as `4`, `6`, `8`, `12`, `16`, `24`, and `32` when the dataset size supports them.
`auto_k_min`	`int`	`2`	Lower bound for automatically generated candidate values when `auto_k_candidates` is omitted.
`auto_k_max`	`int \| None`	`None`	Upper bound for automatically generated candidate values when `auto_k_candidates` is omitted.
`auto_k_step`	`int \| None`	`None`	Optional arithmetic step for generated candidates. If omitted, `clostera` uses a baked-in candidate template.
`auto_k_sample_rows`	`int`	`16_384`	Number of PQ codes sampled for the Rust-side candidate analysis pass.

`Clusterer.fit(...)`, `transform(...)`, `fit_transform(...)`, `fit_predict(...)`, `predict(...)`

Parameter	Type	Default	Meaning
`data`	`np.ndarray \| PathLike`	`required`	Raw float vectors as an array, parquet path, or `numpy.memmap`-backed matrix.
`parquet_column`	`str \| None`	`None`	Specific parquet vector column.
`batch_size`	`int`	`65_536`	Parquet streaming batch size.
`codes_output_path`	`PathLike \| None`	`None`	Optional memmap destination when raw parquet input must be encoded first.
`max_ram_bytes`	`int \| None`	`None`	Optional RAM budget for bounded-memory raw-vector workflows.

Advanced access after fitting:

encoder_: the fitted PQEncoder or OPQEncoder
clusterer_: the fitted PQKMeans or OPQMeans
labels_, cluster_centers_, inertia_history_, selected_k_, k_selection_

Advanced low-level API

The classes below expose the encoder/clusterer split directly. Reach for them when you want to reuse PQ codes, separate training phases, or tune encoder-specific and clusterer-specific parameters independently.

`PQEncoder`

Parameter	Type	Default	Meaning
`num_subquantizers`	`int \| None`	`None`	Number of PQ subspaces `M`. When omitted, `clostera` infers a deterministic default from the input dimensionality. Explicit values still require the dimensionality to be divisible by `M`.
`codebook_size`	`int`	`256`	Number of codewords per subspace `Ks`. Supported range is `2..=256`.
`iterations`	`int`	`20`	Number of Lloyd iterations for subspace k-means training.
`seed`	`int`	`0`	Deterministic seed used for initialization fallback and reproducible training behavior.
`opq_iterations`	`int`	`0`	Number of OPQ refinement steps. `0` keeps plain PQ, `>0` learns an orthogonal rotation before final PQ training.

`OPQEncoder`

OPQEncoder has the same API and runtime methods as PQEncoder, but defaults opq_iterations to 3.

`PQEncoder.fit(...)`

Parameter	Type	Default	Meaning
`data`	`np.ndarray \| PathLike`	`required`	A dense `float32` matrix or a parquet path.
`parquet_column`	`str \| None`	`None`	Specific parquet column to treat as the vector column.
`batch_size`	`int`	`65_536`	Batch size for parquet streaming.
`train_rows`	`int \| None`	`None`	Number of deterministic training rows to sample. With in-memory arrays, omitting this uses the full matrix unless `max_ram_bytes` is set.
`max_ram_bytes`	`int \| None`	`None`	Optional RAM budget for the training sample plus OPQ workspace. When set, large parquet or memmap-backed inputs are trained from a bounded deterministic sample.

`PQEncoder.transform(...)`

Parameter	Type	Default	Meaning
`data`	`np.ndarray \| PathLike`	`required`	Dense vectors or parquet input.
`parquet_column`	`str \| None`	`None`	Specific parquet vector column.
`batch_size`	`int`	`65_536`	Parquet streaming batch size.
`output_path`	`PathLike \| None`	`None`	Optional destination for a memory-mapped `uint8` code matrix.
`max_ram_bytes`	`int \| None`	`None`	Optional RAM budget for batched encoding. Large raw-vector inputs are processed in chunks; if codes would not fit in RAM, provide `output_path` or call `PQKMeans.fit(...)` directly.

`PQEncoder.fit_transform(...)`

Parameter	Type	Default	Meaning
`data`	`np.ndarray \| PathLike`	`required`	A dense `float32` matrix or a parquet path.
`parquet_column`	`str \| None`	`None`	Specific parquet column to treat as the vector column.
`batch_size`	`int`	`65_536`	Parquet streaming batch size.
`train_rows`	`int \| None`	`None`	Number of deterministic training rows to sample before encoding.
`output_path`	`PathLike \| None`	`None`	Optional destination for a memory-mapped `uint8` code matrix produced by the transform phase.
`max_ram_bytes`	`int \| None`	`None`	Optional RAM budget applied to both training and encoding.

`PQEncoder.inverse_transform(...)`

Parameter	Type	Default	Meaning
`codes`	`np.ndarray`	`required`	A 2D PQ code matrix with shape `(rows, num_subquantizers)`. Returns decoded `float32` vectors.

`PQKMeans`

Parameter	Type	Default	Meaning
`encoder`	`PQEncoder`	`required`	Trained encoder that defines the codebooks.
`k`	`int \| None`	`None`	Number of target clusters. Here `K` means the number of clusters. `None` enables Rust-side automatic number-of-clusters selection over candidate values in PQ code space.
`iterations`	`int`	`20`	Number of clustering update rounds.
`seed`	`int`	`0`	Deterministic seed for cluster-center initialization.
`verbose`	`bool`	`False`	Emit inertia diagnostics during fitting.
`lookup_table_bytes`	`int`	`1 << 30`	Memory budget for code-domain lookup tables. Larger budgets favor faster assignment.
`auto_k_method`	`str`	`"centroid_silhouette"`	Automatic-number-of-clusters (`K`) scoring rule. Supported values are `"centroid_silhouette"`, `"davies_bouldin"`, `"elbow"`, and `"bic"`.
`auto_k_candidates`	`list[int] \| tuple[int, ...] \| np.ndarray \| None`	`None`	Explicit candidate `K` values, meaning candidate cluster counts, to test when `k=None`. If omitted, `clostera` builds a default candidate template automatically, including practical values such as `4`, `6`, `8`, `12`, `16`, `24`, and `32` when the dataset size supports them.
`auto_k_min`	`int`	`2`	Lower bound for automatically generated candidate values when `auto_k_candidates` is omitted.
`auto_k_max`	`int \| None`	`None`	Upper bound for automatically generated candidate values when `auto_k_candidates` is omitted.
`auto_k_step`	`int \| None`	`None`	Optional arithmetic step for generated candidates. If omitted, `clostera` uses a baked-in candidate template.
`auto_k_sample_rows`	`int`	`16_384`	Number of PQ codes sampled for the Rust-side candidate analysis pass.

`OPQMeans`

OPQMeans mirrors PQKMeans, but treats OPQ as the default rather than an extra knob. If you do not pass encoder=, it lazily creates and fits an OPQEncoder from the raw vectors or parquet source on first fit(...), fit_predict(...), or fit_transform(...). If you do pass encoder=, the current code requires it to have been trained with opq_iterations > 0.

Parameter	Type	Default	Meaning
`encoder`	`PQEncoder \| None`	`None`	Optional pre-trained OPQ encoder. If omitted, `OPQMeans` builds one automatically.
`num_subquantizers`	`int \| None`	`None`	Optional encoder-side PQ subspace count when `encoder` is omitted.
`codebook_size`	`int`	`256`	Optional encoder-side codebook size when `encoder` is omitted.
`encoder_iterations`	`int`	`20`	Encoder training iterations used when `encoder` is omitted.
`seed`	`int`	`0`	Deterministic seed shared by the implicit encoder and the clusterer.
`opq_iterations`	`int`	`3`	OPQ refinement steps used by the implicit encoder.
`k`	`int \| None`	`None`	Number of target clusters. Here `K` means the number of clusters. `None` enables Rust-side automatic number-of-clusters selection over candidate values in PQ code space.
`iterations`	`int`	`20`	Number of clustering update rounds.
`verbose`	`bool`	`False`	Emit inertia diagnostics during fitting.
`lookup_table_bytes`	`int`	`1 << 30`	Memory budget for code-domain lookup tables. Larger budgets favor faster assignment.
`auto_k_method`	`str`	`"centroid_silhouette"`	Automatic-number-of-clusters (`K`) scoring rule. Supported values are `"centroid_silhouette"`, `"davies_bouldin"`, `"elbow"`, and `"bic"`.
`auto_k_candidates`	`list[int] \| tuple[int, ...] \| np.ndarray \| None`	`None`	Explicit candidate `K` values, meaning candidate cluster counts, to test when `k=None`. If omitted, `clostera` builds a default candidate template automatically, including practical values such as `4`, `6`, `8`, `12`, `16`, `24`, and `32` when the dataset size supports them.
`auto_k_min`	`int`	`2`	Lower bound for automatically generated candidate values when `auto_k_candidates` is omitted.
`auto_k_max`	`int \| None`	`None`	Upper bound for automatically generated candidate values when `auto_k_candidates` is omitted.
`auto_k_step`	`int \| None`	`None`	Optional arithmetic step for generated candidates. If omitted, `clostera` uses a baked-in candidate template.
`auto_k_sample_rows`	`int`	`16_384`	Number of PQ codes sampled for the Rust-side candidate analysis pass.

OPQMeans uses the same runtime method signatures as PQKMeans: fit(...), transform(...), fit_transform(...), fit_predict(...), and predict(...).

`PQKMeans.fit(...)`, `transform(...)`, `fit_transform(...)`, `fit_predict(...)`, `predict(...)`

Parameter	Type	Default	Meaning
`data`	`np.ndarray \| PathLike`	`required`	Either raw vectors or precomputed PQ codes.
`parquet_column`	`str \| None`	`None`	Specific parquet vector column.
`batch_size`	`int`	`65_536`	Parquet streaming batch size.
`codes_output_path`	`PathLike \| None`	`None`	Optional memmap destination when raw parquet input must be encoded first.
`max_ram_bytes`	`int \| None`	`None`	Optional RAM budget for encoding raw vectors into PQ codes before clustering. When set and no `codes_output_path` is supplied, `clostera` creates a temporary memmap automatically.

When k=None, fitting also populates:

selected_k_: the final chosen cluster count (K)
k_selection_: the full Rust-side selection report, including the tested candidate values and per-method scores

Advanced runtime knob

Environment variable	Meaning
`CLOSTERA_ROTATION_BATCH_MIB`	Override the default OPQ rotation batch target in MiB for benchmarking or machine-specific tuning.

Reproducing the benchmark artifacts

Generate a deterministic synthetic dataset

python scripts/generate_synthetic_dataset.py \
  --output-dir .artifacts/block-mixed-200k-2048 \
  --distribution block_mixed \
  --rows 200000 \
  --dim 2048 \
  --clusters 64 \
  --seed 11

Compare the original repo and clostera

python scripts/compare_impls.py \
  --dataset-dir .artifacts/block-mixed-200k-2048 \
  --original-python "$(which python)" \
  --enhanced-python "$(which python)" \
  --train-rows 32768 \
  --metric-sample-rows 32768 \
  --num-subquantizers 64 \
  --codebook-size 64 \
  --pq-iterations 6 \
  --cluster-k 64 \
  --cluster-iterations 4 \
  --opq-iterations 3 \
  --blas-threads 24 \
  --omp-threads 24 \
  --rayon-threads 24 \
  --rotation-batch-mib 32 \
  --output-json .artifacts/block-mixed-200k-2048/compare.json

Run the K (number of clusters) sweep

python scripts/benchmark_k_sweep.py \
  --dataset-dir .artifacts/k-sweep-block-mixed-200k-2048 \
  --output-json benchmarks/results/k-sweep.json \
  --original-python "$(which python)" \
  --enhanced-python "$(which python)" \
  --force

Run the N sweep

python scripts/benchmark_n_sweep.py \
  --dataset-dir .artifacts/n-sweep-block-mixed-800k-2048 \
  --output-json benchmarks/results/n-sweep.json \
  --original-python "$(which python)" \
  --enhanced-python "$(which python)" \
  --force

Run the full deterministic distribution suite

python scripts/benchmark_suite.py \
  --output-dir .artifacts/benchmark-suite \
  --original-python "$(which python)" \
  --enhanced-python "$(which python)" \
  --blas-threads 24 \
  --omp-threads 24 \
  --rayon-threads 24 \
  --rotation-batch-mib 32 \
  --force

Run the automatic number-of-clusters (`K`) selection sweep

python scripts/evaluate_auto_k_methods.py \
  --output-json benchmarks/results/auto-k-methods.json \
  --force

Render the README and notebook figures

python scripts/render_benchmark_assets.py \
  --suite-json benchmarks/results/benchmark-suite.json \
  --large-json benchmarks/results/large-scale-10m.json \
  --k-sweep-json benchmarks/results/k-sweep.json \
  --n-sweep-json benchmarks/results/n-sweep.json \
  --auto-k-json benchmarks/results/auto-k-methods.json \
  --output-dir docs/assets

Packaging and release

The repository already includes publication artifacts for:

manylinux_2_28 wheels for x86_64 and aarch64
macOS wheels for x86_64 and arm64
CPython 3.10 through 3.13
source distributions

Relevant files:

.github/workflows/ci.yml
.github/workflows/release.yml
rust-toolchain.toml

The release workflow builds wheels with openblas-static enabled so binary installs are as self-contained as practical.

Releasing to PyPI

The PyPI project name is clostera.

Once the one-time PyPI Trusted Publisher setup is done for:

owner: BaseModelAI
repository: clostera
workflow: .github/workflows/release.yml
environment: pypi

the normal release path is:

python scripts/release.py 1.0.1 --commit --tag --push

That updates the version in the release metadata, creates the release commit, creates tag v1.0.1, and pushes both to origin. The tag push triggers the GitHub release workflow, which builds the wheels and publishes them to PyPI.

Original project and related work

Original implementation

Original repository: https://github.com/DwangoMediaVillage/pqkmeans
Original project page: https://yusukematsui.me/project/pqkmeans/pqkmeans.html
Original paper: https://arxiv.org/abs/1709.03708

Core papers behind this repo

Jégou, Douze, Schmid. Product Quantization for Nearest Neighbor Search. IEEE TPAMI 2011. https://ieeexplore.ieee.org/document/5432202/
Ge, He, Ke, Sun. Optimized Product Quantization. IEEE TPAMI 2014. https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/pami13opq.pdf

Useful related reading

André, Kégl, Szegedy. Accelerated Nearest Neighbor Search with Quick ADC. https://arxiv.org/abs/1704.07355
André et al. Quicker ADC: Unlocking the Hidden Potential of Product Quantization with SIMD. https://arxiv.org/abs/1812.09162
Matsui, Uchida, Jégou, Satoh. A Survey of Product Quantization. https://www.jstage.jst.go.jp/article/mta/6/1/6_2/_article

Verification

Current local verification commands:

python -m maturin develop --release
cargo test --release
pytest -q
cargo check --no-default-features --features openblas-static
cargo bench --bench core_bench

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ponythewhite

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.4

Apr 23, 2026

This version

1.0.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clostera-1.0.0.tar.gz (5.1 MB view details)

Uploaded Apr 23, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clostera-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded Apr 23, 2026 CPython 3.13manylinux: glibc 2.28+ x86-64

clostera-1.0.0-cp313-cp313-manylinux_2_28_aarch64.whl (692.6 kB view details)

Uploaded Apr 23, 2026 CPython 3.13manylinux: glibc 2.28+ ARM64

clostera-1.0.0-cp313-cp313-macosx_11_0_arm64.whl (654.6 kB view details)

Uploaded Apr 23, 2026 CPython 3.13macOS 11.0+ ARM64

clostera-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded Apr 23, 2026 CPython 3.12manylinux: glibc 2.28+ x86-64

clostera-1.0.0-cp312-cp312-manylinux_2_28_aarch64.whl (693.0 kB view details)

Uploaded Apr 23, 2026 CPython 3.12manylinux: glibc 2.28+ ARM64

clostera-1.0.0-cp312-cp312-macosx_11_0_arm64.whl (655.2 kB view details)

Uploaded Apr 23, 2026 CPython 3.12macOS 11.0+ ARM64

clostera-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded Apr 23, 2026 CPython 3.11manylinux: glibc 2.28+ x86-64

clostera-1.0.0-cp311-cp311-manylinux_2_28_aarch64.whl (696.7 kB view details)

Uploaded Apr 23, 2026 CPython 3.11manylinux: glibc 2.28+ ARM64

clostera-1.0.0-cp311-cp311-macosx_11_0_arm64.whl (658.6 kB view details)

Uploaded Apr 23, 2026 CPython 3.11macOS 11.0+ ARM64

clostera-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded Apr 23, 2026 CPython 3.10manylinux: glibc 2.28+ x86-64

clostera-1.0.0-cp310-cp310-manylinux_2_28_aarch64.whl (694.0 kB view details)

Uploaded Apr 23, 2026 CPython 3.10manylinux: glibc 2.28+ ARM64

clostera-1.0.0-cp310-cp310-macosx_11_0_arm64.whl (658.5 kB view details)

Uploaded Apr 23, 2026 CPython 3.10macOS 11.0+ ARM64

File details

Details for the file clostera-1.0.0.tar.gz.

File metadata

Download URL: clostera-1.0.0.tar.gz
Upload date: Apr 23, 2026
Size: 5.1 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`b4885286169041cc0d3c8309930915398ddcdb9ca13b46d5a4b7c0ceebf9cd39`
MD5	`84d43fa744c668512c4e02f968780a16`
BLAKE2b-256	`16220e304b1589874b65ccbe1060f8b5d57537d29304b27ae33f147a053f7b3e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0.tar.gz:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0.tar.gz
- Subject digest: b4885286169041cc0d3c8309930915398ddcdb9ca13b46d5a4b7c0ceebf9cd39
- Sigstore transparency entry: 1365632250
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

File details

Details for the file clostera-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

Download URL: clostera-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl
Upload date: Apr 23, 2026
Size: 1.8 MB
Tags: CPython 3.13, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`094f27a143aae58309af0812646f9cc7b886258faa22501e8530731d721625e7`
MD5	`1a7ebb77513c41bd8f8fdc3413d4472d`
BLAKE2b-256	`5afe22f88b73d1272a700cf912a1407aef64cfd03333fe47062cbc07da6d8dc5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0-cp313-cp313-manylinux_2_28_x86_64.whl
- Subject digest: 094f27a143aae58309af0812646f9cc7b886258faa22501e8530731d721625e7
- Sigstore transparency entry: 1365632397
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

File details

Details for the file clostera-1.0.0-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

Download URL: clostera-1.0.0-cp313-cp313-manylinux_2_28_aarch64.whl
Upload date: Apr 23, 2026
Size: 692.6 kB
Tags: CPython 3.13, manylinux: glibc 2.28+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm	Hash digest
SHA256	`ac0dc98b28a97276d18528a705603726d2a8fe497e4fdf251e773ec18266d33c`
MD5	`b92a6cec4d14837c4fcc25dc5b922644`
BLAKE2b-256	`028f2d7a373a082bdf757a09ff98e28059d4494eb20d12701f3a43cc48860932`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp313-cp313-manylinux_2_28_aarch64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0-cp313-cp313-manylinux_2_28_aarch64.whl
- Subject digest: ac0dc98b28a97276d18528a705603726d2a8fe497e4fdf251e773ec18266d33c
- Sigstore transparency entry: 1365632325
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

File details

Details for the file clostera-1.0.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

Download URL: clostera-1.0.0-cp313-cp313-macosx_11_0_arm64.whl
Upload date: Apr 23, 2026
Size: 654.6 kB
Tags: CPython 3.13, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`1dc3ff9a261ad15d25dc1df4a46e3dfae30d73ce3e902bf970548abc9138321a`
MD5	`c9e2063974a53ea7dac3917133f3331b`
BLAKE2b-256	`e3859008ea7bf81f557f56b638dfe9a4b22a895128bdf9c066f5e7b51019fbf5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0-cp313-cp313-macosx_11_0_arm64.whl
- Subject digest: 1dc3ff9a261ad15d25dc1df4a46e3dfae30d73ce3e902bf970548abc9138321a
- Sigstore transparency entry: 1365632689
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

File details

Details for the file clostera-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

Download URL: clostera-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl
Upload date: Apr 23, 2026
Size: 1.8 MB
Tags: CPython 3.12, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`ff1240ffcd6856a4c8134b10318a69a2e52a6044eb561ac1cbbc20df3a811b15`
MD5	`e719da0160ad53043f20735a2631d51d`
BLAKE2b-256	`700c8cdfb1aeab684c9111e41ff9ad8a71b32b7b88689194ce9cee28e21d68ee`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl
- Subject digest: ff1240ffcd6856a4c8134b10318a69a2e52a6044eb561ac1cbbc20df3a811b15
- Sigstore transparency entry: 1365632544
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

File details

Details for the file clostera-1.0.0-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

Download URL: clostera-1.0.0-cp312-cp312-manylinux_2_28_aarch64.whl
Upload date: Apr 23, 2026
Size: 693.0 kB
Tags: CPython 3.12, manylinux: glibc 2.28+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm	Hash digest
SHA256	`8b86073ca57aa7011ba7fe19ebd7318d9c7e06876003bdfe6cbdbcf6297ba8bb`
MD5	`66694fd1624ee2d2b2c0ef5f93e3b7de`
BLAKE2b-256	`d814b4101307c3894d5f7744e3371cebab458c354a96b14e7b066ddee4360b89`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp312-cp312-manylinux_2_28_aarch64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0-cp312-cp312-manylinux_2_28_aarch64.whl
- Subject digest: 8b86073ca57aa7011ba7fe19ebd7318d9c7e06876003bdfe6cbdbcf6297ba8bb
- Sigstore transparency entry: 1365632766
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

File details

Details for the file clostera-1.0.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: clostera-1.0.0-cp312-cp312-macosx_11_0_arm64.whl
Upload date: Apr 23, 2026
Size: 655.2 kB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`52251121ed33cfc3edd43967574f63f253b4922406b029607de7695108f6119c`
MD5	`fbd570b0c6fa842ea0a7b166fd03cc71`
BLAKE2b-256	`43bbd8f6d6a8ec08b72f066f6dfa2d13e2e3af9c580d209b8a4d2c818f5d0a9c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0-cp312-cp312-macosx_11_0_arm64.whl
- Subject digest: 52251121ed33cfc3edd43967574f63f253b4922406b029607de7695108f6119c
- Sigstore transparency entry: 1365633079
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

File details

Details for the file clostera-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

Download URL: clostera-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl
Upload date: Apr 23, 2026
Size: 1.8 MB
Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`4282f75e432827e6facf5cf79a3651f18eab7a2b6caa9fec48cad9aecdb11b02`
MD5	`1ff76fb7584ee80ac41cbdd724390471`
BLAKE2b-256	`ee57a0e7096e159cc544fe3cf63692be9dfd8758e06e08cf021617ffa8bcd425`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0-cp311-cp311-manylinux_2_28_x86_64.whl
- Subject digest: 4282f75e432827e6facf5cf79a3651f18eab7a2b6caa9fec48cad9aecdb11b02
- Sigstore transparency entry: 1365632472
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

File details

Details for the file clostera-1.0.0-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

Download URL: clostera-1.0.0-cp311-cp311-manylinux_2_28_aarch64.whl
Upload date: Apr 23, 2026
Size: 696.7 kB
Tags: CPython 3.11, manylinux: glibc 2.28+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm	Hash digest
SHA256	`2c64c4095577e48aed9d520c29630a4a25d7864f08dadb600a4a65eca3474817`
MD5	`a88f5f48c0053b316c67e1e77a95ab86`
BLAKE2b-256	`f01403b86dc5346fbd2fd20c3301c6a962b5847c7b49fe1cdbfe6d5c8e515e5a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp311-cp311-manylinux_2_28_aarch64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0-cp311-cp311-manylinux_2_28_aarch64.whl
- Subject digest: 2c64c4095577e48aed9d520c29630a4a25d7864f08dadb600a4a65eca3474817
- Sigstore transparency entry: 1365632610
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

File details

Details for the file clostera-1.0.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

Download URL: clostera-1.0.0-cp311-cp311-macosx_11_0_arm64.whl
Upload date: Apr 23, 2026
Size: 658.6 kB
Tags: CPython 3.11, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`bed2978f75668110fb944137dc181a095c5d2563285def96954790da94d1ad1d`
MD5	`f8c6bff4d5b6fd059c50d596d55c10d0`
BLAKE2b-256	`7429d498ead088c14614c78690a5a91fe9166cb9fcf8ffccc9e69101c1ec2623`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0-cp311-cp311-macosx_11_0_arm64.whl
- Subject digest: bed2978f75668110fb944137dc181a095c5d2563285def96954790da94d1ad1d
- Sigstore transparency entry: 1365632922
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

File details

Details for the file clostera-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

Download URL: clostera-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl
Upload date: Apr 23, 2026
Size: 1.8 MB
Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`5540393c33950f32d8a28589d99be01ad8de2936d8ebd66a706a13e661e7042d`
MD5	`59750712c5544baef88cc2b34d09fa05`
BLAKE2b-256	`71774c4bc3c48a4a468aec4049c7f3be9e7b3e3cd87c6235bc18693810ff28a6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0-cp310-cp310-manylinux_2_28_x86_64.whl
- Subject digest: 5540393c33950f32d8a28589d99be01ad8de2936d8ebd66a706a13e661e7042d
- Sigstore transparency entry: 1365633006
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

File details

Details for the file clostera-1.0.0-cp310-cp310-manylinux_2_28_aarch64.whl.

File metadata

Download URL: clostera-1.0.0-cp310-cp310-manylinux_2_28_aarch64.whl
Upload date: Apr 23, 2026
Size: 694.0 kB
Tags: CPython 3.10, manylinux: glibc 2.28+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm	Hash digest
SHA256	`8a5c91b6c521e75be3955c866ee1eb0160f51c97d1c3109605b7eb3f4afbec8a`
MD5	`dafd181e7a9bf9d57c5d784b0d7731a3`
BLAKE2b-256	`30ebdf9f727a0ed505d419c5675a1c30d3b8e0b18fbfa247b42baed4d7fc42ea`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp310-cp310-manylinux_2_28_aarch64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0-cp310-cp310-manylinux_2_28_aarch64.whl
- Subject digest: 8a5c91b6c521e75be3955c866ee1eb0160f51c97d1c3109605b7eb3f4afbec8a
- Sigstore transparency entry: 1365633162
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

File details

Details for the file clostera-1.0.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

Download URL: clostera-1.0.0-cp310-cp310-macosx_11_0_arm64.whl
Upload date: Apr 23, 2026
Size: 658.5 kB
Tags: CPython 3.10, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clostera-1.0.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`931f7fcaa348b1f6eadddce32f7c55f7b91ae176f5d1e0b58fc703b22a3215a1`
MD5	`c0979ed4eec0d1b3313ada1e15a40235`
BLAKE2b-256	`d6cd0ac78ffadc3023300234ea59b4dd6a6be41c791b7da25a15461ab7c719dc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clostera-1.0.0-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: release.yml on BaseModelAI/clostera

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clostera-1.0.0-cp310-cp310-macosx_11_0_arm64.whl
- Subject digest: 931f7fcaa348b1f6eadddce32f7c55f7b91ae176f5d1e0b58fc703b22a3215a1
- Sigstore transparency entry: 1365632857
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: BaseModelAI/clostera@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/BaseModelAI
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f878a341e08bf70dabf20ecd603a83dfe9eb98d9
- Trigger Event: push

clostera 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

clostera: The Billion-Vector Resurrection

⚡️ Quick Start: It just works

The zero-tuning path

The fastest path

Out-of-core from parquet

⚡️ The Miracle of 30.8x: Bending Time

💾 The Alchemy of Memory: Zero-RAM Scaling

A 2D example using k-means, clostera-quality, and clostera-fastest

Large-scale evaluation

🧠 The Oracle of K: Automatic number of clusters without guesswork

💎 The Obsidian Core: Engineered for modern silicon

🔁 From research repo to production rewrite

📊 The Benchmarks of Truth

Large-scale checkpoint: 10,000,000 x 2048

K sweep: how the number of clusters changes runtime

N sweep: how runtime scales with dataset size

Distribution suite: speed and quality across different data families

🍏 Apple Silicon is a first-class target

🔧 Under the hood: better initialization, less luck

Installation

PyPI

Build from source

More common workflows

Simple workflow

Known number-of-clusters (K) workflow

Fastest throughput workflow with a known number of clusters (K)

Predict on new vectors

Parquet workflow

Out-of-core raw-vector workflow

Advanced API

Showcase notebook

Parameter reference

Clusterer

Clusterer.fit(...), transform(...), fit_transform(...), fit_predict(...), predict(...)

Advanced low-level API

PQEncoder

OPQEncoder

PQEncoder.fit(...)

PQEncoder.transform(...)

PQEncoder.fit_transform(...)

PQEncoder.inverse_transform(...)

PQKMeans

OPQMeans

PQKMeans.fit(...), transform(...), fit_transform(...), fit_predict(...), predict(...)

Advanced runtime knob

Reproducing the benchmark artifacts

Generate a deterministic synthetic dataset

Compare the original repo and clostera

Run the K (number of clusters) sweep

Run the N sweep

Run the full deterministic distribution suite

Run the automatic number-of-clusters (K) selection sweep

Render the README and notebook figures

Packaging and release

Releasing to PyPI

Original project and related work

Original implementation

Core papers behind this repo

Useful related reading

Verification

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Large-scale checkpoint: `10,000,000 x 2048`

Known number-of-clusters (`K`) workflow

Fastest throughput workflow with a known number of clusters (`K`)

`Clusterer`

`Clusterer.fit(...)`, `transform(...)`, `fit_transform(...)`, `fit_predict(...)`, `predict(...)`

`PQEncoder`

`OPQEncoder`

`PQEncoder.fit(...)`

`PQEncoder.transform(...)`

`PQEncoder.fit_transform(...)`

`PQEncoder.inverse_transform(...)`

`PQKMeans`

`OPQMeans`

`PQKMeans.fit(...)`, `transform(...)`, `fit_transform(...)`, `fit_predict(...)`, `predict(...)`

Run the automatic number-of-clusters (`K`) selection sweep