Redis-backed Locality Sensitive Hashing toolkit for fast approximate nearest neighbor search
Project description
LSHRS
Redis-backed locality-sensitive hashing toolkit that stores bucket membership in Redis while keeping the heavy vector payloads in your primary datastore.
Table of Contents
- Overview
- Architecture Snapshot
- Key Features
- Installation
- Quick Start
- Ingestion Pipelines
- Querying Modes
- Persistence & Lifecycle
- Performance & Scaling Guidelines
- Troubleshooting
- API Surface Summary
- Development & Testing
- License
Overview
LSHRS orchestrates the full locality-sensitive hashing (LSH) workflow:
- Hash incoming vectors into stable banded signatures via random projections.
- Store only bucket membership in Redis for low-latency candidate enumeration.
- Optionally rerank candidates using cosine similarity with vectors fetched from your system of record.
The out-of-the-box configuration chooses bands/rows automatically, pipelines Redis operations, and exposes hooks for streaming data ingestion, persistence, and operational maintenance.
Architecture Snapshot
| Concern | Component | Description |
|---|---|---|
| Hashing | LSHHasher |
Generates banded random-projection signatures. |
| Storage | RedisStorage |
Persists bucket membership using Redis sets and pipelines for batch writes. |
| Ingestion | LSHRS.create_signatures() |
Streams vectors from PostgreSQL or Parquet via pluggable loaders. |
| Reranking | top_k_cosine() |
Computes cosine similarity for candidate reranking. |
| Configuration | get_optimal_config() |
Picks band/row counts that match a target similarity threshold. |
Key Features
- Redis-native buckets: Uses Redis sets for O(1) membership updates and pipelined batch ingestion.
- Progressive indexing: Stream vectors from PostgreSQL (
iter_postgres_vectors()) or Parquet (iter_parquet_vectors()) without exhausting memory. - Dual retrieval modes: Choose fast top-k collision lookups or cosine-reranked top-p filtering through
LSHRS.query(). - Persistable hashing state: Save and reload projection matrices with
LSHRS.save_to_disk()andLSHRS.load_from_disk(). - Operational safety: Snapshot configuration with
LSHRS.stats(), clear indices viaLSHRS.clear(), and surgically delete members usingLSHRS.delete().
Installation
PyPI
pip install lshrs
Or, if installing for a postgres database:
pip install 'lshrs[postgres]'
Or with Parquet ingestion support:
pip install 'lshrs[parquet]'
From source checkout
git clone https://github.com/mxngjxa/lshrs.git
cd lshrs
uv sync -e ".[dev]"
[!NOTE] The project requires Python >= 3.10 as defined in
pyproject.toml.
Optional extras
- PostgreSQL streaming requires
psycopg. Install withpip install 'lshrs[postgres]'. - Parquet ingestion requires
pyarrow. Install withpip install 'lshrs[parquet]'.
Quick Start
import numpy as np
from lshrs import LSHRS
def fetch_vectors(indices: list[int]) -> np.ndarray:
# Replace with your vector store retrieval (PostgreSQL, disk, object store, etc.)
embeddings = np.load("vectors.npy")
return embeddings[indices]
lsh = LSHRS(
dim=768,
num_perm=256,
redis_host="localhost",
redis_prefix="demo",
vector_fetch_fn=fetch_vectors,
)
# Stream index construction from PostgreSQL
lsh.create_signatures(
format="postgres",
dsn="postgresql://user:pass@localhost/db",
table="documents",
index_column="doc_id",
vector_column="embedding",
)
# Insert an ad-hoc document
lsh.ingest(42, np.random.randn(768).astype(np.float32))
# Retrieve candidates
query = np.random.randn(768).astype(np.float32)
top10 = lsh.get_top_k(query, topk=10)
reranked = lsh.get_above_p(query, p=0.2)
The code above exercises LSHRS.create_signatures(), LSHRS.ingest(), LSHRS.get_top_k(), and LSHRS.get_above_p().
Ingestion Pipelines
Streaming from PostgreSQL
iter_postgres_vectors() yields (indices, vectors) batches using server-side cursors:
lsh.create_signatures(
format="postgres",
dsn="postgresql://reader:secret@analytics.db/search",
table="embeddings",
index_column="item_id",
vector_column="embedding",
batch_size=5_000,
where_clause="updated_at >= NOW() - INTERVAL '1 day'",
)
[!TIP] Provide a custom
connection_factoryif you need pooled connections or TLS configuration.
Streaming from Parquet
iter_parquet_vectors() supports memory-friendly batch loads from Parquet files:
for ids, batch in iter_parquet_vectors(
"captures/2024-01-embeddings.parquet",
index_column="document_id",
vector_column="embedding",
batch_size=8_192,
):
lsh.index(ids, batch)
[!IMPORTANT] Install
pyarrowprior to using the Parquet loader; otherwiseiter_parquet_vectors()raisesImportError.
Manual or Buffered Ingestion
LSHRS.index()ingests vector batches you already hold in memory.LSHRS.ingest()is ideal for realtime single-document updates.- Under the hood,
RedisStorage.batch_add()leverages Redis pipelines for throughput.
Querying Modes
LSHRS.query() provides two complementary retrieval patterns:
| Mode | When to use | Result |
|---|---|---|
Top-k (top_p=None) |
Latency-critical scenarios that only require coarse candidates. | Returns List[int] ordered by band collisions. |
Top-p (top_p=0.0–1.0) |
Precision-sensitive flows that can rerank using original vectors. | Returns List[Tuple[int,float]] of (index, cosine_similarity) pairs. |
[!CAUTION] Reranking requires configuring
vector_fetch_fnwhen instantiatingLSHRS; otherwise top-p queries raiseRuntimeError.
Supporting helpers:
LSHRS.get_top_k()wrapsqueryfor pure top-k retrieval.LSHRS.get_above_p()wrapsquerywith a similarity-mass cutoff.- Cosine scoring is provided by
cosine_similarity()andtop_k_cosine().
Persistence & Lifecycle
| Operation | Purpose | Reference |
|---|---|---|
| Snapshot configuration | Inspect runtime parameters and Redis namespace. | LSHRS.stats() |
| Flush & clear | Remove all Redis buckets for the configured prefix. | LSHRS.clear() |
| Hard delete members | Remove specific indices across all buckets. | LSHRS.delete() |
| Persist projections | Save configuration and projection matrices to disk. | LSHRS.save_to_disk() |
| Restore projections | Rebuild an instance using saved matrices. | LSHRS.load_from_disk() |
[!WARNING]
LSHRS.clear()is irreversible—every key with the configured prefix is deleted. Back up state withLSHRS.save_to_disk()beforehand if you need to rebuild.
Performance & Scaling Guidelines
- Choose sensible hash parameters:
get_optimal_config()finds bands/rows that approximate your target similarity threshold. Inspect S-curve behavior withcompute_collision_probability(). - Normalize inputs: Pre-normalize vectors or rely on
l2_norm()for consistent cosine scores. - Batch ingestion: When indexing large volumes, route operations through
LSHRS.index()to letRedisStorage.batch_add()coalesce writes. - Monitor bucket sizes: Large buckets indicate low selectivity. Adjust
num_perm,num_bands, or the similarity threshold to trade precision vs. recall. - Pipeline warmup: Flush outstanding operations with
LSHRS._flush_buffer()(indirectly called) before measuring latency or persisting state.
Troubleshooting
| Symptom | Likely Cause | Resolution |
|---|---|---|
ImportError: psycopg is required |
PostgreSQL loader invoked without optional dependency. | Install psycopg[binary] or avoid format="postgres". |
ValueError: Vectors must have shape (n, dim) |
Supplied batch dimension mismatched the configured dim. |
Ensure all vectors match the dim passed to LSHRS.__init__(). |
ValueError: Cannot normalize zero vector |
Zero-length vectors were passed to cosine scoring utilities. | Filter zero vectors before reranking or normalize upstream. |
| Empty search results | Buckets never flushed to Redis. | Call LSHRS.index() (auto flushes) or explicitly invoke LSHRS._flush_buffer() before querying. |
| Extremely large buckets | Similarity threshold too low / insufficient hash bits. | Increase num_perm or tweak target threshold via get_optimal_config(). |
[!TIP] Use Redis
SCANcommands (e.g.,SCAN 0 MATCH lsh:*) to inspect bucket distribution during tuning.
API Surface Summary
| Area | Description | Primary Entry Point |
|---|---|---|
| Ingestion orchestration | Bulk streaming with source-aware loaders. | LSHRS.create_signatures() |
| Batch ingestion | Hash and store vectors already in memory. | LSHRS.index() |
| Single ingestion | Add or update one vector id on the fly. | LSHRS.ingest() |
| Candidate enumeration | General-purpose search with optional reranking. | LSHRS.query() |
| Hash persistence | Save and restore LSH projection matrices. | LSHRS.save_to_disk() / LSHRS.load_from_disk() |
| Redis maintenance | Prefix-aware key deletion and batch removal. | RedisStorage.clear() / RedisStorage.remove_indices() |
| Probability utilities | Analyze band/row trade-offs and false rates. | compute_collision_probability() / compute_false_rates() |
Development & Testing
-
Clone and install development dependencies:
git clone https://github.com/mxngjxa/lshrs.git cd lshrs uv sync --dev
-
Run the test suite:
uv run pytest
-
Lint and format check:
uv run ruff check . uv run ruff format --check .
[!NOTE] Example snippets in this README are intended to be run under Python >= 3.10 with NumPy >= 1.24 and Redis >= 7 as specified in
pyproject.toml.
License
Licensed under the terms of LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lshrs-0.1.1b2.tar.gz.
File metadata
- Download URL: lshrs-0.1.1b2.tar.gz
- Upload date:
- Size: 135.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62536f161d4e2939ff60908067529a30efea18420c25ff240fd7f871e83e9854
|
|
| MD5 |
641d3dceafd7ba4d928479f1cfd50d05
|
|
| BLAKE2b-256 |
291a3805b79fb71f695541e410901f249063e1edf410bd751de3d9a2ed07fa48
|
File details
Details for the file lshrs-0.1.1b2-py3-none-any.whl.
File metadata
- Download URL: lshrs-0.1.1b2-py3-none-any.whl
- Upload date:
- Size: 50.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b062374802de02e1c10d05117029ded75030a80be59883c74cbf41653b9c035
|
|
| MD5 |
9c7b05ed0f214480b48f68a57c458a30
|
|
| BLAKE2b-256 |
25aad7a63e3016a6fa91684eb33685e41b87907530d64fbcad2839e71c631f5c
|