Johnson–Lindenstrauss Projection Toolkit for dimensionality reduction

These details have not been verified by PyPI

Project links

Homepage

Project description

JLProj — Johnson–Lindenstrauss Projection Toolkit

JLProj is a Python toolkit for dimensionality reduction using the Johnson–Lindenstrauss (JL) lemma. It preserves pairwise distances between high-dimensional vectors with strong theoretical guarantees.

The toolkit provides:

CLI for projection, search, and file inspection
Python API for batch and single-vector projection
FAISS-compatible search in compressed space
Matrix serialization for reproducibility

Capabilities

JL Projection: Fast and distance-preserving linear projection
Search: Find nearest vectors in compressed space (via FAISS)
Serialization: Save/load projection matrices
CLI: Use as a command-line tool for embedding pipelines
Reconstruction: Approximate inverse projection supported (via saved matrix)

Use Cases

Reducing the size of vector databases in semantic search systems (RAG pipelines)
Enabling fast similarity search under memory constraints
Embedding compression for offline applications or edge inference
Experimental comparison of PCA vs random projection

Why Johnson–Lindenstrauss?

In many NLP and ML systems, embeddings such as BERT or SentenceTransformer outputs have high dimensionality (e.g., 384 or 768 dimensions). These embeddings are powerful, but storing and searching over millions of such vectors becomes computationally expensive.

The Johnson–Lindenstrauss Lemma (1984) provides a mathematical guarantee that such high-dimensional vectors can be projected into a significantly lower-dimensional space (e.g., 64 or 32 dimensions) without significantly distorting the distances between them.

This is crucial in applications like:

approximate nearest neighbor search
semantic retrieval (RAG pipelines)
memory-constrained vector storage

What the JL Lemma says

Given a small distortion level ε in (0, 1), the JL Lemma states:

(1 - ε) * ||x - y||² ≤ ||f(x) - f(y)||² ≤ (1 + ε) * ||x - y||²

Where:

x, y ∈ ℝᵈ are your original vectors
f is a random linear projection (e.g., multiplying by a Gaussian matrix)
f(x), f(y) ∈ ℝᵏ live in the lower-dimensional space

And the required dimension k scales as:

k = O(log(n) / ε²)

This means you can project thousands of vectors into a space of dimension 32–128 and still maintain the pairwise geometry with high fidelity. Unlike PCA, this method offers explicit probabilistic guarantees on distance preservation.

Limitations

Does not preserve angular similarity (cosine) as well as PCA in some cases
Inverse projection is approximate; exact recovery is not possible
FAISS search operates on projected vectors, not original ones
Requires a projection matrix to be saved for decompression

📈 Distance Preservation

The JL projection preserves pairwise distances with low distortion:

Distance and error metrics

Mean relative error: 6.94%
Max error: 33.57%

🧭 Embedding Visualization

Comparison of projected and original embeddings using PCA and UMAP:

Embedding visualization )

🧭 Embedding Structure Comparison

To visually assess the geometric consistency of the JL projection, we compared 2D reductions of the original and projected vectors using PCA and UMAP.

Embedding structure )

The structure and distribution of clusters is largely preserved, confirming that semantic neighborhood relations are maintained under JL compression.

📊 Benchmark: JL vs PCA vs UMAP

This benchmark compares the distance preservation accuracy of Johnson–Lindenstrauss projection (JL), PCA, and UMAP on synthetic high-dimensional vectors.
• Input vectors were sampled from a normal distribution (N(0, 1)
• Projections were made from 128, 768, and 2048 dimensions to 32–384
• For each method, we compute the mean relative error of pairwise distances
• Lower error means better preservation of the original vector space

JL vs PCA and UMAP (mean relative error)

From → To	JL Error	PCA Error	UMAP Error	JL vs PCA	JL vs UMAP
128 → 32	0.1010	0.3517	0.9008	+248.1%	+791.5%
128 → 64	0.0718	0.1606	0.9006	+123.7%	+1154.2%
768 → 32	0.0978	0.5826	0.9606	+495.9%	+882.6%
768 → 64	0.0715	0.4392	0.9607	+514.4%	+1244.0%
768 → 128	0.0500	0.2620	0.9607	+423.7%	+1820.4%
768 → 256	0.0363	0.0880	0.9607	+142.4%	+2547.0%
768 → 384	0.0296	0.0186	0.9607	–37.1%	+3151.0%
2048 → 32	0.0990	0.6481	0.9757	+554.5%	+885.4%
2048 → 64	0.0739	0.5219	0.9758	+606.2%	+1220.4%
2048 → 128	0.0502	0.3523	0.9758	+601.4%	+1842.7%
2048 → 256	0.0345	0.1581	0.9758	+357.6%	+2724.9%
2048 → 384	0.0292	0.0517	0.9757	+77.2%	+3240.9%

Installation

pip install -e .

Install dependencies:

pip install numpy faiss-cpu sentence-transformers scikit-learn

Usage

In Python

import numpy as np
from jlproj.projector import JLProjector

# Load high-dimensional vectors
X = np.load("embeddings.npy")  # shape: (n_samples, dim_in)

# Initialize projector and fit
projector = JLProjector(dim_out=64)
projector.fit(dim_in=X.shape[1])

# Project the full matrix
X_proj = projector.transform(X)
np.save("compressed.npy", X_proj)

# Save projection matrix
projector.save("projection_matrix.npz")

# Later: load the matrix and project a single query vector
projector2 = JLProjector()
projector2.load("projection_matrix.npz")

query = np.random.randn(X.shape[1])
query_proj = projector2.transform_query(query)

From the Command Line (CLI)

All commands are accessible via:

python -m jlproj.cli <command> [args]

Compress embeddings

python -m jlproj.cli compress embeddings.npy --dim 64 --out compressed.npy --save-matrix

Search nearest neighbors

python -m jlproj.cli search --index compressed.npy --query query.npy --k 5

Inspect file shape

python -m jlproj.cli info compressed.npy

Decompress using saved projection matrix

python -m jlproj.cli decompress --input compressed.npy --matrix compressed_matrix.npz --out restored.npy

API Overview

transform(X) — project a full batch of vectors (e.g. 10,000 × 768 → 10,000 × 64)
transform_query(x) — fast projection for a single input (e.g. real-time search)

Tests

This project includes unit tests for the core projection functionality (JLProjector):

shape validation (projected vectors have expected shape)
distance preservation (relative error stays bounded)
serialization / deserialization of projection matrix
single-query transformation (for search scenarios)

Run tests with:

pytest

Tests are located in tests/test_projector.py.

Example output:

============================= test session starts ==============================
test_projector.py::test_projection_shape PASSED
test_projector.py::test_distance_preservation PASSED
test_projector.py::test_transform_query_shape PASSED
test_projector.py::test_save_and_load PASSED
============================== 4 passed in 0.61s ===============================

Project Structure

jlproj/
├── projector.py         # JLProjector core class
├── cli.py               # Command-line interface (compress/search/info/decompress)
├── faiss_wrapper.py     # (optional for future)
├── __init__.py

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.8

May 21, 2025

0.1.7

May 21, 2025

0.1.6

May 21, 2025

0.1.4

May 21, 2025

0.1.1

May 21, 2025

0.1.0

May 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jlproj-0.1.8.tar.gz (251.2 kB view details)

Uploaded May 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jlproj-0.1.8-py3-none-any.whl (8.2 kB view details)

Uploaded May 21, 2025 Python 3

File details

Details for the file jlproj-0.1.8.tar.gz.

File metadata

Download URL: jlproj-0.1.8.tar.gz
Upload date: May 21, 2025
Size: 251.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.0

File hashes

Hashes for jlproj-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`707b18d0b997ac88f4cb2ec52eaa273e998ec62f9ac6f44f0f31d72a5883c24d`
MD5	`92737b8f19f12a690bc0c0167247687a`
BLAKE2b-256	`bc774a06e1224f49a180dcb462b631f34cf1e180a5990985ff634f25d3634a91`

See more details on using hashes here.

File details

Details for the file jlproj-0.1.8-py3-none-any.whl.

File metadata

Download URL: jlproj-0.1.8-py3-none-any.whl
Upload date: May 21, 2025
Size: 8.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.0

File hashes

Hashes for jlproj-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c25c2fda773b483cd8d47ebcf2441bd7d651e2297b6f32c5aa3a6b6ddbea2d7`
MD5	`857281cdc2c0f87ed3b5b603da22d086`
BLAKE2b-256	`1e8bc1a9e33eaf097c6a84b75647ede58dff63fbaf7481a66e962fc10809448c`

See more details on using hashes here.

jlproj 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

JLProj — Johnson–Lindenstrauss Projection Toolkit

Capabilities

Use Cases

Why Johnson–Lindenstrauss?

What the JL Lemma says

Limitations

📈 Distance Preservation

🧭 Embedding Visualization

🧭 Embedding Structure Comparison

📊 Benchmark: JL vs PCA vs UMAP

Installation

Usage

In Python

From the Command Line (CLI)

Compress embeddings

Search nearest neighbors

Inspect file shape

Decompress using saved projection matrix

API Overview

Tests

Project Structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes