Johnson–Lindenstrauss Projection Toolkit for dimensionality reduction
Project description
JLProj — Johnson–Lindenstrauss Projection Toolkit
JLProj is a Python toolkit for dimensionality reduction using the Johnson–Lindenstrauss (JL) lemma. It preserves pairwise distances between high-dimensional vectors with strong theoretical guarantees.
The toolkit provides:
- CLI for projection, search, and file inspection
- Python API for batch and single-vector projection
- FAISS-compatible search in compressed space
- Matrix serialization for reproducibility
Capabilities
- JL Projection: Fast and distance-preserving linear projection
- Search: Find nearest vectors in compressed space (via FAISS)
- Serialization: Save/load projection matrices
- CLI: Use as a command-line tool for embedding pipelines
- Reconstruction: Approximate inverse projection supported (via saved matrix)
Use Cases
- Reducing the size of vector databases in semantic search systems (RAG pipelines)
- Enabling fast similarity search under memory constraints
- Embedding compression for offline applications or edge inference
- Experimental comparison of PCA vs random projection
Why Johnson–Lindenstrauss?
In many NLP and ML systems, embeddings such as BERT or SentenceTransformer outputs have high dimensionality (e.g., 384 or 768 dimensions). These embeddings are powerful, but storing and searching over millions of such vectors becomes computationally expensive.
The Johnson–Lindenstrauss Lemma (1984) provides a mathematical guarantee that such high-dimensional vectors can be projected into a significantly lower-dimensional space (e.g., 64 or 32 dimensions) without significantly distorting the distances between them.
This is crucial in applications like:
- approximate nearest neighbor search
- semantic retrieval (RAG pipelines)
- memory-constrained vector storage
What the JL Lemma says
Given a small distortion level ε in (0, 1), the JL Lemma states:
(1 - ε) * ||x - y||² ≤ ||f(x) - f(y)||² ≤ (1 + ε) * ||x - y||²
Where:
- x, y ∈ ℝᵈ are your original vectors
- f is a random linear projection (e.g., multiplying by a Gaussian matrix)
- f(x), f(y) ∈ ℝᵏ live in the lower-dimensional space
And the required dimension k scales as:
k = O(log(n) / ε²)
This means you can project thousands of vectors into a space of dimension 32–128 and still maintain the pairwise geometry with high fidelity. Unlike PCA, this method offers explicit probabilistic guarantees on distance preservation.
Limitations
- Does not preserve angular similarity (cosine) as well as PCA in some cases
- Inverse projection is approximate; exact recovery is not possible
- FAISS search operates on projected vectors, not original ones
- Requires a projection matrix to be saved for decompression
📈 Distance Preservation
The JL projection preserves pairwise distances with low distortion:
- Mean relative error: 6.94%
- Max error: 33.57%
🧭 Embedding Visualization
Comparison of projected and original embeddings using PCA and UMAP:
)
🧭 Embedding Structure Comparison
To visually assess the geometric consistency of the JL projection, we compared 2D reductions of the original and projected vectors using PCA and UMAP.
)
The structure and distribution of clusters is largely preserved, confirming that semantic neighborhood relations are maintained under JL compression.
📊 Benchmark: JL vs PCA vs UMAP
This benchmark compares the distance preservation accuracy of Johnson–Lindenstrauss projection (JL), PCA, and UMAP on synthetic high-dimensional vectors.
• Input vectors were sampled from a normal distribution (N(0, 1)
• Projections were made from 128, 768, and 2048 dimensions to 32–384
• For each method, we compute the mean relative error of pairwise distances
• Lower error means better preservation of the original vector space
JL vs PCA and UMAP (mean relative error)
| From → To | JL Error | PCA Error | UMAP Error | JL vs PCA | JL vs UMAP |
|---|---|---|---|---|---|
| 128 → 32 | 0.1010 | 0.3517 | 0.9008 | +248.1% | +791.5% |
| 128 → 64 | 0.0718 | 0.1606 | 0.9006 | +123.7% | +1154.2% |
| 768 → 32 | 0.0978 | 0.5826 | 0.9606 | +495.9% | +882.6% |
| 768 → 64 | 0.0715 | 0.4392 | 0.9607 | +514.4% | +1244.0% |
| 768 → 128 | 0.0500 | 0.2620 | 0.9607 | +423.7% | +1820.4% |
| 768 → 256 | 0.0363 | 0.0880 | 0.9607 | +142.4% | +2547.0% |
| 768 → 384 | 0.0296 | 0.0186 | 0.9607 | –37.1% | +3151.0% |
| 2048 → 32 | 0.0990 | 0.6481 | 0.9757 | +554.5% | +885.4% |
| 2048 → 64 | 0.0739 | 0.5219 | 0.9758 | +606.2% | +1220.4% |
| 2048 → 128 | 0.0502 | 0.3523 | 0.9758 | +601.4% | +1842.7% |
| 2048 → 256 | 0.0345 | 0.1581 | 0.9758 | +357.6% | +2724.9% |
| 2048 → 384 | 0.0292 | 0.0517 | 0.9757 | +77.2% | +3240.9% |
Installation
pip install -e .
Install dependencies:
pip install numpy faiss-cpu sentence-transformers scikit-learn
Usage
In Python
import numpy as np
from jlproj.projector import JLProjector
# Load high-dimensional vectors
X = np.load("embeddings.npy") # shape: (n_samples, dim_in)
# Initialize projector and fit
projector = JLProjector(dim_out=64)
projector.fit(dim_in=X.shape[1])
# Project the full matrix
X_proj = projector.transform(X)
np.save("compressed.npy", X_proj)
# Save projection matrix
projector.save("projection_matrix.npz")
# Later: load the matrix and project a single query vector
projector2 = JLProjector()
projector2.load("projection_matrix.npz")
query = np.random.randn(X.shape[1])
query_proj = projector2.transform_query(query)
From the Command Line (CLI)
All commands are accessible via:
python -m jlproj.cli <command> [args]
Compress embeddings
python -m jlproj.cli compress embeddings.npy --dim 64 --out compressed.npy --save-matrix
Search nearest neighbors
python -m jlproj.cli search --index compressed.npy --query query.npy --k 5
Inspect file shape
python -m jlproj.cli info compressed.npy
Decompress using saved projection matrix
python -m jlproj.cli decompress --input compressed.npy --matrix compressed_matrix.npz --out restored.npy
API Overview
transform(X)— project a full batch of vectors (e.g. 10,000 × 768 → 10,000 × 64)transform_query(x)— fast projection for a single input (e.g. real-time search)
Tests
This project includes unit tests for the core projection functionality (JLProjector):
- shape validation (projected vectors have expected shape)
- distance preservation (relative error stays bounded)
- serialization / deserialization of projection matrix
- single-query transformation (for search scenarios)
Run tests with:
pytest
Tests are located in tests/test_projector.py.
Example output:
============================= test session starts ==============================
test_projector.py::test_projection_shape PASSED
test_projector.py::test_distance_preservation PASSED
test_projector.py::test_transform_query_shape PASSED
test_projector.py::test_save_and_load PASSED
============================== 4 passed in 0.61s ===============================
Project Structure
jlproj/
├── projector.py # JLProjector core class
├── cli.py # Command-line interface (compress/search/info/decompress)
├── faiss_wrapper.py # (optional for future)
├── __init__.py
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jlproj-0.1.8.tar.gz.
File metadata
- Download URL: jlproj-0.1.8.tar.gz
- Upload date:
- Size: 251.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
707b18d0b997ac88f4cb2ec52eaa273e998ec62f9ac6f44f0f31d72a5883c24d
|
|
| MD5 |
92737b8f19f12a690bc0c0167247687a
|
|
| BLAKE2b-256 |
bc774a06e1224f49a180dcb462b631f34cf1e180a5990985ff634f25d3634a91
|
File details
Details for the file jlproj-0.1.8-py3-none-any.whl.
File metadata
- Download URL: jlproj-0.1.8-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c25c2fda773b483cd8d47ebcf2441bd7d651e2297b6f32c5aa3a6b6ddbea2d7
|
|
| MD5 |
857281cdc2c0f87ed3b5b603da22d086
|
|
| BLAKE2b-256 |
1e8bc1a9e33eaf097c6a84b75647ede58dff63fbaf7481a66e962fc10809448c
|