Copenhagen — Dynamic IVF index with quantum-inspired updates and adaptive cluster rebalancing

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

Copenhagen

A fully dynamic approximate nearest neighbor (ANN) index built on Inverted File Index (IVF). Handles continuous inserts and deletes without ever rebuilding.

Standard ANN indexes (FAISS IVF, HNSW) assume a static or slowly-changing dataset. When data drifts — new product SKUs, freshly-published articles, continuous message embeddings — their centroids no longer represent the full data manifold, clusters grow massively imbalanced, and recall collapses for new-data queries. Copenhagen solves this with three mechanisms that compose:

O(1) amortized insert — per-vector cost is constant in n (fixed clusters k, dimension d); stays flat from 5k to 100k+ vectors on real data while HNSW insert cost grows 30× over the same range
O(1) tombstone delete with lazy compaction at search time
Soft multi-cluster assignment (soft_k) for recall near Voronoi boundaries
Adaptive cluster splitting — adds new centroids only where drift has overloaded an existing cell, no offline rebuild

The design draws on the logarithmic-method bucket structure and tombstone correctness argument from arXiv:2604.00271 ("Engineering Fully Dynamic Convex Hulls", IT University of Copenhagen).

When to use Copenhagen

Use Copenhagen when insert throughput and live deletes matter more than explaining why the distribution shifted.

Scenario	Why Copenhagen
E-commerce catalog — 10k new SKUs/hour	0.4 µs/vector insert, never goes offline; adaptive splits recover recall after catalog expansions
News / real-time content index	O(1) tombstone delete (0.3 µs/delete, zero leaked results); FAISS IVF has no native delete primitive
Chat / document embedding cache	`soft_k=2` keeps recall high at Voronoi boundaries; storage overhead is exactly 2× standard IVF
Streaming with gradual drift	Recall stays stable as new-distribution vectors arrive in batches; FAISS degrades monotonically

When NOT to use Copenhagen — if you need to detect where or in what direction the distribution drifted. Copenhagen only tells you a cluster grew too large and splits it. For directional drift signals, use AMPI, which tracks per-cluster subspace drift via Oja sketches.

Quick start

Build from source

cd src && bash build.sh

Requires C++17 and Apple Accelerate (macOS) or OpenBLAS (Linux). Deposits copenhagen.so into python/core/.

Basic usage

import numpy as np
from python.core import CopenhagenIndex

idx = CopenhagenIndex(dim=128, n_clusters=32, nprobe=4, soft_k=2)

# Train + insert in one call — first add() trains on the batch, subsequent adds insert
train = np.random.randn(10_000, 128).astype(np.float32)
idx.add(train)

# Insert more (incremental, no rebuild)
new_vecs = np.random.randn(5_000, 128).astype(np.float32)
idx.add(new_vecs)

# Search
query = np.random.randn(128).astype(np.float32)
ids, dists = idx.search(query, k=10)

# Delete by ID (O(1) tombstone)
idx.delete(ids[0])

# Compact tombstones immediately (also auto-fires at 10% churn)
idx.compact()

# Save / load
idx.save("my_index/")
idx2 = CopenhagenIndex.load("my_index/")

Streaming insert

For large streams that shouldn't be materialised in memory:

def vector_generator():
    for chunk in fetch_from_kafka():
        yield chunk.astype(np.float32)   # shape (dim,) or (n, dim)

first_id, last_id = idx.insert_stream(vector_generator(), chunk_size=1000)

GPU acceleration

# Offloads centroid distance computation via torch.mm (cuBLAS / Metal MPS)
idx = CopenhagenIndex(dim=128, n_clusters=32, device="cuda")   # NVIDIA
idx = CopenhagenIndex(dim=128, n_clusters=32, device="mps")    # Apple M-series

Centroids are pinned on device once at train time; only (n, soft_k) argmin indices transfer back per batch. Search stays on CPU (single-query centroid ranking is too small to benefit from device offload).

Parameters

Parameter	Default	Effect
`n_clusters`	—	Number of IVF clusters (Voronoi cells). Rule of thumb: `sqrt(n_vectors)`
`nprobe`	1	Clusters scanned per query. Higher = better recall, slower search
`soft_k`	1	Clusters each vector is indexed in. `soft_k=2` matches FAISS rebuild recall on drift benchmarks
`split_threshold`	3.0	Split a cluster when `live_size > mean_size × threshold`
`max_split_iters`	10	Mini k-means iterations inside `split_cluster`
`use_pq`	False	Product Quantization for compressed storage and faster approximate search
`use_mmap`	False	Memory-map cluster storage for indexes larger than RAM

split_threshold and soft_k are writable on the index object at any time:

idx._index.split_threshold = 2.5   # more aggressive splits
idx._index.soft_k = 2

Monitoring

# Global stats
stats = idx.get_stats()
# {'n_vectors': 15000, 'n_clusters': 33, 'deleted_count': 12, ...}

# Per-cluster breakdown
for cs in idx.get_cluster_stats():
    print(cs['cluster_id'], cs['live_size'], cs['physical_size'], cs['last_split_round'])
    # last_split_round = -1 for training-time clusters
    # last_split_round = N for clusters created by the Nth insert_batch call

get_cluster_stats() returns a list of dicts with cluster_id, live_size, physical_size, centroid (numpy array), and last_split_round. Useful for surfacing which clusters are overloaded and which have recently split.

Benchmark results

Full results and run instructions: BENCHMARKS.md

Distribution drift — MNIST → Fashion-MNIST (784d, quick mode)

Train on MNIST (20k vectors), insert Fashion-MNIST (10k vectors) without retraining. Fashion vectors land in wrong Voronoi cells; clusters bloat 3–4×.

Method	Fashion recall@10	MNIST recall@10	Insert time
FAISS IVF add-only	0.953	0.973	21 ms
FAISS IVF full rebuild	0.989	0.974	119 ms
AMPI (nlist=212, fans=16, probes=16)	0.997	0.974	7,063 ms
Copenhagen baseline (soft_k=1, no splits)	0.962	0.980	29 ms
Copenhagen soft_k=2 (fixed)	0.989	0.994	65 ms
Copenhagen best (soft_k=2 + splits)	0.979	0.993	168 ms

Copenhagen soft_k=2 matches FAISS full rebuild on fashion recall (0.989 vs 0.989) at 1.8× faster insert (65 ms vs 119 ms), with +2.0pp MNIST recall. Copenhagen best is ~42× faster to insert than AMPI at a cost of 1.8pp fashion recall. AMPI is the right call when drift recall is the primary metric and insert latency is not a constraint.

Gradual streaming drift (500 vectors/batch × 10 batches)

Method	Recall at batch 1	Mid	Final
FAISS add-only	0.946	0.977	0.980
Copenhagen baseline	0.916	0.959	0.969
Copenhagen best (soft_k=2 + splits)	0.972	0.986	0.987

Copenhagen best leads FAISS from the very first batch (+2.6pp) and finishes +0.7pp ahead.

Insert scaling — O(1) vs O(log n) (SIFT-128, d=128)

n	CPH µs/vec	CPH R@10	FAISS IVF µs/vec	IVF R@10	HNSW µs/vec	HNSW R@10
5,000	1.16	0.993	0.79	0.951	280	1.000
25,000	1.12	0.997	0.64	0.972	1,092	1.000
100,000	1.08	0.996	0.67	0.985	8,419	0.997

CPH and FAISS IVF both show flat insert cost (O(1) in n). HNSW grows 30× over the same range. HNSW starts at 240× the CPH insert cost and reaches 7,800× at n=100k. CPH recall (0.993–0.998) exceeds FAISS IVF (0.951–0.985) at the same nprobe on this dataset.

Static recall vs HNSW (SIFT-100k, d=128)

On a static dataset HNSW dominates: at R@10 ≈ 0.97, HNSW M=32 ef=32 runs at 5,803 QPS vs CPH n=32 nprobe=4 at 3,661 QPS. CPH's advantage is insert cost (240–7,800× cheaper per vector at n=5k–100k, see insert scaling above) and O(1) delete — HNSW has neither.

Full recall/QPS table: BENCHMARKS.md §8.

Streaming churn (30% delete/round, n=50k, 10 rounds)

Method	Recall@10	Inserts/s	Deletes/s
HNSW + filter	0.47	—	—
HNSW + rebuild	0.95	~10k	—
Copenhagen	0.93–0.95	~900k	~1M

Full tables in BENCHMARKS.md.

Tombstone delete

0.3 µs per delete, zero leaked results. FAISS IVF has no native delete primitive; equivalent via rebuild costs ~11 ms for 10k vectors.

Repository layout

src/                  C++ extension (dynamic_ivf.cpp, build.sh)
python/core/          Python wrapper (CopenhagenIndex), __init__.py, copenhagen.so
benchmarks/           Benchmark scripts:
  benchmark_drift.py              MNIST → Fashion drift (one-shot + --per-cluster)
  benchmark_drift_streaming.py    Gradual streaming drift (batches over time)
  benchmark_hnsw_churn.py         30% churn vs HNSW
  benchmark_ivf_churn.py          30% churn vs FAISS IVF
  benchmark_vs_faiss.py           Static recall and throughput
tests/                smoke_test.py, stress_test.py, test_gpu.py, bench_gpu.py, bench_search.py
                      GPU and FAISS tests must run separately: pytest -m gpu / pytest -m faiss
data/                 Dataset storage (MNIST, Fashion-MNIST, SIFT)
results/              Benchmark output (JSON)

Credits

arXiv:2604.00271 — van der Hoog, Reinstädtler, Rotenberg (IT University of Copenhagen). The logarithmic-method bucket structure, quarter-full invariant, and tombstone correctness argument for ranked queries directly informed the Copenhagen design.
FAISS (Facebook AI Research) — IVF baseline for all benchmarks.
AMPI (github.com/sashakolpakov/ampi) — Adaptive multi-projection index with per-cluster Oja drift detection. Recommended when directional drift signals are needed rather than pure insert throughput; used as comparison baseline in drift benchmarks.

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

0.1.0

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copenhagen-0.1.0.tar.gz (29.4 kB view details)

Uploaded Apr 9, 2026 Source

File details

Details for the file copenhagen-0.1.0.tar.gz.

File metadata

Download URL: copenhagen-0.1.0.tar.gz
Upload date: Apr 9, 2026
Size: 29.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for copenhagen-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2e5a470c6af9faeea17bb093bd0b82fc195bc8296aa1995a8dd55eacc9095d9c`
MD5	`3dcc49ce0b4039d02ae1d01b81a9b5db`
BLAKE2b-256	`7b0241ca5c4e7846f74bf149ed545fdb3e0e74e95bff07259e30468d50c61d4c`

See more details on using hashes here.

copenhagen 0.1.0

Navigation

Verified details

Owner

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Copenhagen

When to use Copenhagen

Quick start

Build from source

Basic usage

Streaming insert

GPU acceleration

Parameters

Monitoring

Benchmark results

Distribution drift — MNIST → Fashion-MNIST (784d, quick mode)

Gradual streaming drift (500 vectors/batch × 10 batches)

Insert scaling — O(1) vs O(log n) (SIFT-128, d=128)

Static recall vs HNSW (SIFT-100k, d=128)

Streaming churn (30% delete/round, n=50k, 10 rounds)

Tombstone delete

Repository layout

Credits

Project details

Verified details

Owner

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes