Skip to main content

Redis-backed Locality Sensitive Hashing toolkit for fast approximate nearest neighbor search

Project description

LSHRS

CI Publish to PyPI PyPI version Python MIT License Ruff

Redis-backed locality-sensitive hashing toolkit that stores bucket membership in Redis while keeping the heavy vector payloads in your primary datastore.

logo

Table of Contents

Overview

LSHRS orchestrates the full locality-sensitive hashing (LSH) workflow:

  1. Hash incoming vectors into stable banded signatures via random projections.
  2. Store only bucket membership in Redis for low-latency candidate enumeration.
  3. Optionally rerank candidates using cosine similarity with vectors fetched from your system of record.

The out-of-the-box configuration chooses bands/rows automatically, pipelines Redis operations, and exposes hooks for streaming data ingestion, persistence, and operational maintenance.

Architecture Snapshot

Concern Component Description
Hashing LSHHasher Generates banded random-projection signatures.
Storage RedisStorage Persists bucket membership using Redis sets and pipelines for batch writes.
Ingestion LSHRS.create_signatures() Streams vectors from PostgreSQL or Parquet via pluggable loaders.
Reranking top_k_cosine() Computes cosine similarity for candidate reranking.
Configuration get_optimal_config() Picks band/row counts that match a target similarity threshold.

Key Features

Installation

PyPI

pip install lshrs

Or, if installing for a postgres database:

pip install 'lshrs[postgres]'

Or with Parquet ingestion support:

pip install 'lshrs[parquet]'

From source checkout

git clone https://github.com/mxngjxa/lshrs.git
cd lshrs
uv sync -e ".[dev]"

[!NOTE] The project requires Python >= 3.10 as defined in pyproject.toml.

Optional extras

  • PostgreSQL streaming requires psycopg. Install with pip install 'lshrs[postgres]'.
  • Parquet ingestion requires pyarrow. Install with pip install 'lshrs[parquet]'.

Quick Start

import numpy as np
from lshrs import LSHRS

def fetch_vectors(indices: list[int]) -> np.ndarray:
    # Replace with your vector store retrieval (PostgreSQL, disk, object store, etc.)
    embeddings = np.load("vectors.npy")
    return embeddings[indices]

lsh = LSHRS(
    dim=768,
    num_perm=256,
    redis_host="localhost",
    redis_prefix="demo",
    vector_fetch_fn=fetch_vectors,
)

# Stream index construction from PostgreSQL
lsh.create_signatures(
    format="postgres",
    dsn="postgresql://user:pass@localhost/db",
    table="documents",
    index_column="doc_id",
    vector_column="embedding",
)

# Insert an ad-hoc document
lsh.ingest(42, np.random.randn(768).astype(np.float32))

# Retrieve candidates
query = np.random.randn(768).astype(np.float32)
top10 = lsh.get_top_k(query, topk=10)
reranked = lsh.get_above_p(query, p=0.2)

The code above exercises LSHRS.create_signatures(), LSHRS.ingest(), LSHRS.get_top_k(), and LSHRS.get_above_p().

Ingestion Pipelines

Streaming from PostgreSQL

iter_postgres_vectors() yields (indices, vectors) batches using server-side cursors:

lsh.create_signatures(
    format="postgres",
    dsn="postgresql://reader:secret@analytics.db/search",
    table="embeddings",
    index_column="item_id",
    vector_column="embedding",
    batch_size=5_000,
    where_clause="updated_at >= NOW() - INTERVAL '1 day'",
)

[!TIP] Provide a custom connection_factory if you need pooled connections or TLS configuration.

Streaming from Parquet

iter_parquet_vectors() supports memory-friendly batch loads from Parquet files:

for ids, batch in iter_parquet_vectors(
    "captures/2024-01-embeddings.parquet",
    index_column="document_id",
    vector_column="embedding",
    batch_size=8_192,
):
    lsh.index(ids, batch)

[!IMPORTANT] Install pyarrow prior to using the Parquet loader; otherwise iter_parquet_vectors() raises ImportError.

Manual or Buffered Ingestion

Querying Modes

LSHRS.query() provides two complementary retrieval patterns:

Mode When to use Result
Top-k (top_p=None) Latency-critical scenarios that only require coarse candidates. Returns List[int] ordered by band collisions.
Top-p (top_p=0.0–1.0) Precision-sensitive flows that can rerank using original vectors. Returns List[Tuple[int,float]] of (index, cosine_similarity) pairs.

[!CAUTION] Reranking requires configuring vector_fetch_fn when instantiating LSHRS; otherwise top-p queries raise RuntimeError.

Supporting helpers:

Persistence & Lifecycle

Operation Purpose Reference
Snapshot configuration Inspect runtime parameters and Redis namespace. LSHRS.stats()
Flush & clear Remove all Redis buckets for the configured prefix. LSHRS.clear()
Hard delete members Remove specific indices across all buckets. LSHRS.delete()
Persist projections Save configuration and projection matrices to disk. LSHRS.save_to_disk()
Restore projections Rebuild an instance using saved matrices. LSHRS.load_from_disk()

[!WARNING] LSHRS.clear() is irreversible—every key with the configured prefix is deleted. Back up state with LSHRS.save_to_disk() beforehand if you need to rebuild.

Performance & Scaling Guidelines

  • Choose sensible hash parameters: get_optimal_config() finds bands/rows that approximate your target similarity threshold. Inspect S-curve behavior with compute_collision_probability().
  • Normalize inputs: Pre-normalize vectors or rely on l2_norm() for consistent cosine scores.
  • Batch ingestion: When indexing large volumes, route operations through LSHRS.index() to let RedisStorage.batch_add() coalesce writes.
  • Monitor bucket sizes: Large buckets indicate low selectivity. Adjust num_perm, num_bands, or the similarity threshold to trade precision vs. recall.
  • Pipeline warmup: Flush outstanding operations with LSHRS._flush_buffer() (indirectly called) before measuring latency or persisting state.

Troubleshooting

Symptom Likely Cause Resolution
ImportError: psycopg is required PostgreSQL loader invoked without optional dependency. Install psycopg[binary] or avoid format="postgres".
ValueError: Vectors must have shape (n, dim) Supplied batch dimension mismatched the configured dim. Ensure all vectors match the dim passed to LSHRS.__init__().
ValueError: Cannot normalize zero vector Zero-length vectors were passed to cosine scoring utilities. Filter zero vectors before reranking or normalize upstream.
Empty search results Buckets never flushed to Redis. Call LSHRS.index() (auto flushes) or explicitly invoke LSHRS._flush_buffer() before querying.
Extremely large buckets Similarity threshold too low / insufficient hash bits. Increase num_perm or tweak target threshold via get_optimal_config().

[!TIP] Use Redis SCAN commands (e.g., SCAN 0 MATCH lsh:*) to inspect bucket distribution during tuning.

API Surface Summary

Area Description Primary Entry Point
Ingestion orchestration Bulk streaming with source-aware loaders. LSHRS.create_signatures()
Batch ingestion Hash and store vectors already in memory. LSHRS.index()
Single ingestion Add or update one vector id on the fly. LSHRS.ingest()
Candidate enumeration General-purpose search with optional reranking. LSHRS.query()
Hash persistence Save and restore LSH projection matrices. LSHRS.save_to_disk() / LSHRS.load_from_disk()
Redis maintenance Prefix-aware key deletion and batch removal. RedisStorage.clear() / RedisStorage.remove_indices()
Probability utilities Analyze band/row trade-offs and false rates. compute_collision_probability() / compute_false_rates()

Development & Testing

  1. Clone and install development dependencies:

    git clone https://github.com/mxngjxa/lshrs.git
    cd lshrs
    uv sync --dev
    
  2. Run the test suite:

    uv run pytest
    
  3. Lint and format check:

    uv run ruff check .
    uv run ruff format --check .
    

[!NOTE] Example snippets in this README are intended to be run under Python >= 3.10 with NumPy >= 1.24 and Redis >= 7 as specified in pyproject.toml.

License

Licensed under the terms of LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lshrs-0.1.1b2.tar.gz (135.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lshrs-0.1.1b2-py3-none-any.whl (50.8 kB view details)

Uploaded Python 3

File details

Details for the file lshrs-0.1.1b2.tar.gz.

File metadata

  • Download URL: lshrs-0.1.1b2.tar.gz
  • Upload date:
  • Size: 135.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lshrs-0.1.1b2.tar.gz
Algorithm Hash digest
SHA256 62536f161d4e2939ff60908067529a30efea18420c25ff240fd7f871e83e9854
MD5 641d3dceafd7ba4d928479f1cfd50d05
BLAKE2b-256 291a3805b79fb71f695541e410901f249063e1edf410bd751de3d9a2ed07fa48

See more details on using hashes here.

File details

Details for the file lshrs-0.1.1b2-py3-none-any.whl.

File metadata

  • Download URL: lshrs-0.1.1b2-py3-none-any.whl
  • Upload date:
  • Size: 50.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lshrs-0.1.1b2-py3-none-any.whl
Algorithm Hash digest
SHA256 4b062374802de02e1c10d05117029ded75030a80be59883c74cbf41653b9c035
MD5 9c7b05ed0f214480b48f68a57c458a30
BLAKE2b-256 25aad7a63e3016a6fa91684eb33685e41b87907530d64fbcad2839e71c631f5c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page