Skip to main content

Johnson–Lindenstrauss Projection Toolkit for dimensionality reduction

Project description

PyPI version License: MIT Downloads

JLProj — Johnson–Lindenstrauss Projection Toolkit

JLProj is a Python toolkit for dimensionality reduction using the Johnson–Lindenstrauss (JL) lemma. It preserves pairwise distances between high-dimensional vectors with strong theoretical guarantees.

The toolkit provides:

  • CLI for projection, search, and file inspection
  • Python API for batch and single-vector projection
  • FAISS-compatible search in compressed space
  • Matrix serialization for reproducibility

Capabilities

  • JL Projection: Fast and distance-preserving linear projection
  • Search: Find nearest vectors in compressed space (via FAISS)
  • Serialization: Save/load projection matrices
  • CLI: Use as a command-line tool for embedding pipelines
  • Reconstruction: Approximate inverse projection supported (via saved matrix)

Use Cases

  • Reducing the size of vector databases in semantic search systems (RAG pipelines)
  • Enabling fast similarity search under memory constraints
  • Embedding compression for offline applications or edge inference
  • Experimental comparison of PCA vs random projection

Why Johnson–Lindenstrauss?

In many NLP and ML systems, embeddings such as BERT or SentenceTransformer outputs have high dimensionality (e.g., 384 or 768 dimensions). These embeddings are powerful, but storing and searching over millions of such vectors becomes computationally expensive.

The Johnson–Lindenstrauss Lemma (1984) provides a mathematical guarantee that such high-dimensional vectors can be projected into a significantly lower-dimensional space (e.g., 64 or 32 dimensions) without significantly distorting the distances between them.

This is crucial in applications like:

  • approximate nearest neighbor search
  • semantic retrieval (RAG pipelines)
  • memory-constrained vector storage

What the JL Lemma says

Given a small distortion level ε in (0, 1), the JL Lemma states:

(1 - ε) * ||x - y||² ≤ ||f(x) - f(y)||² ≤ (1 + ε) * ||x - y||²

Where:

  • x, y ∈ ℝᵈ are your original vectors
  • f is a random linear projection (e.g., multiplying by a Gaussian matrix)
  • f(x), f(y) ∈ ℝᵏ live in the lower-dimensional space

And the required dimension k scales as:

k = O(log(n) / ε²)

This means you can project thousands of vectors into a space of dimension 32–128 and still maintain the pairwise geometry with high fidelity. Unlike PCA, this method offers explicit probabilistic guarantees on distance preservation.

Limitations

  • Does not preserve angular similarity (cosine) as well as PCA in some cases
  • Inverse projection is approximate; exact recovery is not possible
  • FAISS search operates on projected vectors, not original ones
  • Requires a projection matrix to be saved for decompression

📈 Distance Preservation

The JL projection preserves pairwise distances with low distortion:

Distance and error metrics

  • Mean relative error: 6.94%
  • Max error: 33.57%

🧭 Embedding Visualization

Comparison of projected and original embeddings using PCA and UMAP:

Embedding visualization )

🧭 Embedding Structure Comparison

To visually assess the geometric consistency of the JL projection, we compared 2D reductions of the original and projected vectors using PCA and UMAP.

Embedding structure )

The structure and distribution of clusters is largely preserved, confirming that semantic neighborhood relations are maintained under JL compression.

📊 Benchmark: JL vs PCA vs UMAP

This benchmark compares the distance preservation accuracy of Johnson–Lindenstrauss projection (JL), PCA, and UMAP on synthetic high-dimensional vectors.
• Input vectors were sampled from a normal distribution (N(0, 1)
• Projections were made from 128, 768, and 2048 dimensions to 32–384
• For each method, we compute the mean relative error of pairwise distances
• Lower error means better preservation of the original vector space

JL vs PCA and UMAP (mean relative error)

From → To JL Error PCA Error UMAP Error JL vs PCA JL vs UMAP
128 → 32 0.1010 0.3517 0.9008 +248.1% +791.5%
128 → 64 0.0718 0.1606 0.9006 +123.7% +1154.2%
768 → 32 0.0978 0.5826 0.9606 +495.9% +882.6%
768 → 64 0.0715 0.4392 0.9607 +514.4% +1244.0%
768 → 128 0.0500 0.2620 0.9607 +423.7% +1820.4%
768 → 256 0.0363 0.0880 0.9607 +142.4% +2547.0%
768 → 384 0.0296 0.0186 0.9607 –37.1% +3151.0%
2048 → 32 0.0990 0.6481 0.9757 +554.5% +885.4%
2048 → 64 0.0739 0.5219 0.9758 +606.2% +1220.4%
2048 → 128 0.0502 0.3523 0.9758 +601.4% +1842.7%
2048 → 256 0.0345 0.1581 0.9758 +357.6% +2724.9%
2048 → 384 0.0292 0.0517 0.9757 +77.2% +3240.9%

Installation

pip install -e .

Install dependencies:

pip install numpy faiss-cpu sentence-transformers scikit-learn

Usage

In Python

import numpy as np
from jlproj.projector import JLProjector

# Load high-dimensional vectors
X = np.load("embeddings.npy")  # shape: (n_samples, dim_in)

# Initialize projector and fit
projector = JLProjector(dim_out=64)
projector.fit(dim_in=X.shape[1])

# Project the full matrix
X_proj = projector.transform(X)
np.save("compressed.npy", X_proj)

# Save projection matrix
projector.save("projection_matrix.npz")

# Later: load the matrix and project a single query vector
projector2 = JLProjector()
projector2.load("projection_matrix.npz")

query = np.random.randn(X.shape[1])
query_proj = projector2.transform_query(query)

From the Command Line (CLI)

All commands are accessible via:

python -m jlproj.cli <command> [args]

Compress embeddings

python -m jlproj.cli compress embeddings.npy --dim 64 --out compressed.npy --save-matrix

Search nearest neighbors

python -m jlproj.cli search --index compressed.npy --query query.npy --k 5

Inspect file shape

python -m jlproj.cli info compressed.npy

Decompress using saved projection matrix

python -m jlproj.cli decompress --input compressed.npy --matrix compressed_matrix.npz --out restored.npy

API Overview

  • transform(X) — project a full batch of vectors (e.g. 10,000 × 768 → 10,000 × 64)
  • transform_query(x) — fast projection for a single input (e.g. real-time search)

Tests

This project includes unit tests for the core projection functionality (JLProjector):

  • shape validation (projected vectors have expected shape)
  • distance preservation (relative error stays bounded)
  • serialization / deserialization of projection matrix
  • single-query transformation (for search scenarios)

Run tests with:

pytest

Tests are located in tests/test_projector.py.

Example output:

============================= test session starts ==============================
test_projector.py::test_projection_shape PASSED
test_projector.py::test_distance_preservation PASSED
test_projector.py::test_transform_query_shape PASSED
test_projector.py::test_save_and_load PASSED
============================== 4 passed in 0.61s ===============================

Project Structure

jlproj/
├── projector.py         # JLProjector core class
├── cli.py               # Command-line interface (compress/search/info/decompress)
├── faiss_wrapper.py     # (optional for future)
├── __init__.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jlproj-0.1.8.tar.gz (251.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jlproj-0.1.8-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file jlproj-0.1.8.tar.gz.

File metadata

  • Download URL: jlproj-0.1.8.tar.gz
  • Upload date:
  • Size: 251.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.0

File hashes

Hashes for jlproj-0.1.8.tar.gz
Algorithm Hash digest
SHA256 707b18d0b997ac88f4cb2ec52eaa273e998ec62f9ac6f44f0f31d72a5883c24d
MD5 92737b8f19f12a690bc0c0167247687a
BLAKE2b-256 bc774a06e1224f49a180dcb462b631f34cf1e180a5990985ff634f25d3634a91

See more details on using hashes here.

File details

Details for the file jlproj-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: jlproj-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.0

File hashes

Hashes for jlproj-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 4c25c2fda773b483cd8d47ebcf2441bd7d651e2297b6f32c5aa3a6b6ddbea2d7
MD5 857281cdc2c0f87ed3b5b603da22d086
BLAKE2b-256 1e8bc1a9e33eaf097c6a84b75647ede58dff63fbaf7481a66e962fc10809448c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page