Proxiss: Accelerating nearest-neighbor search for high-dimensional data!
Project description
Proxiss: Fast Vector Similarity Search
Proxiss is a high-performance C++ library with Python bindings, designed for fast vector similarity search in high-dimensional data. It provides efficient nearest-neighbor search capabilities for applications like semantic search, recommendation systems, and machine learning, currently optimized for Linux environments.
Key Features
- High Performance: Optimized C++ implementation with OpenMP parallelization for fast k-NN searches
- Multiple Distance Metrics: Supports common distance functions:
- Euclidean (L2)
- Manhattan (L1)
- Cosine Similarity
- Three Search Modes:
- ProxiFlat: Vector-only indexing for pure similarity search
- ProxiKNN: Classification-focused search with label storage
- ProxiPCA: Dimensionality reduction combined with similarity search
- Batched Operations: Efficient batch processing for multiple queries
- Python Integration
Why Proxiss?
Vector similarity search is fundamental to many modern applications, but traditional methods can be slow and resource-intensive. Proxiss addresses this by:
- Providing optimized C++ implementations with parallel processing
- Offering clean, simple APIs that hide implementation complexity
- Focusing on core functionality without unnecessary overhead
- Supporting pure vector search, classification, and dimensionality reduction use cases
Installation
Proxiss builds from source with automatic dependency management. or from PyPI https://pypi.org/project/proxiss/
Prerequisites
- Linux environment (Ubuntu, Debian, CentOS, etc.)
- Python 3.10 or higher
- CMake 3.16 or higher
- UV package manager
Note: The build system automatically installs clang++, OpenMP and pybind11 if not found.
Building from Source
-
Clone the repository:
git clone https://github.com/BiradarSiddhant02/Proxiss.git cd Proxiss
-
Install UV (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh
-
Create virtual environment and install:
uv venv source .venv/bin/activate uv pip install . -v
Quick Start
ProxiFlat: Vector Similarity Search
from proxiss import ProxiFlat
import numpy as np
# Sample data
embeddings = np.array([
[0.0, 0.0],
[1.0, 1.0],
[2.0, 2.0],
[3.0, 3.0]
], dtype=np.float32)
# Initialize ProxiFlat
px = ProxiFlat(k=2, num_threads=2, objective_function="l2")
# Index your vectors
px.index_data(embeddings)
# Query for nearest neighbors
query = np.array([1.5, 1.5], dtype=np.float32)
indices = px.find_indices(query)
print(f"Nearest neighbor indices: {indices}")
# Batch queries
queries = np.array([[0.5, 0.5], [2.5, 2.5]], dtype=np.float32)
batch_indices = px.find_indices_batched(queries)
print(f"Batch results: {batch_indices}")
# Save and load index
px.save_state("index.bin")
px_loaded = ProxiFlat(k=2, num_threads=2, objective_function="l2")
px_loaded.load_state("index.bin")
ProxiKNN: Classification Search
from proxiss import ProxiKNN
import numpy as np
# Sample data with labels
features = np.array([
[0.0, 0.0], [1.0, 1.0],
[5.0, 5.0], [6.0, 6.0]
], dtype=np.float32)
labels = np.array([0, 0, 1, 1], dtype=np.float32)
# Initialize and train
knn = ProxiKNN(n_neighbours=2, n_jobs=2, distance_function="l2")
knn.fit(features, labels)
# Predict
query = np.array([0.5, 0.5], dtype=np.float32)
prediction = knn.predict([query])
print(f"Predicted class: {prediction}")
# Save and load model
knn.save_state("model_dir")
knn_loaded = ProxiKNN(n_neighbours=2, n_jobs=2, distance_function="l2")
knn_loaded.load_state("model_dir")
ProxiPCA: Dimensionality Reduction + Search
from proxiss import ProxiPCA
import numpy as np
# High-dimensional sample data (e.g., 768-dimensional embeddings)
embeddings = np.random.randn(1000, 768).astype(np.float32)
# Initialize ProxiPCA with dimensionality reduction
# n_components as percentage: 0.065 means reduce to 6.5% of original dimensions
# For 768D data: 768 * 0.065 ≈ 50 dimensions
pca = ProxiPCA(k=5, num_threads=4, objective_function="l2", n_components=0.065)
# Fit PCA, transform data, and index in one step
pca.fit_transform_index(embeddings)
print(f"Original dimensions: {embeddings.shape[1]}")
print(f"Reduced dimensions: {pca.get_n_components()}")
# Query for nearest neighbors (query is automatically transformed)
query = np.random.randn(768).astype(np.float32)
indices = pca.find_indices(query)
print(f"Nearest neighbor indices: {indices}")
# Batch queries
queries = np.random.randn(10, 768).astype(np.float32)
batch_indices = pca.find_indices_batched(queries)
print(f"Batch results shape: {batch_indices.shape}")
# Insert new data (automatically transformed)
new_data = np.random.randn(100, 768).astype(np.float32)
pca.insert_data(new_data)
# Save and load (saves both PCA transformation and index)
pca.save_state("pca_index.bin")
pca_loaded = ProxiPCA(k=5, num_threads=4, objective_function="l2", n_components=0.065)
pca_loaded.load_state("pca_index.bin")
Benchmarking
Proxiss includes benchmarking scripts to evaluate performance.
1. Generate Test Data
Create synthetic datasets for benchmarking:
python scripts/make_data.py --N 10000 --D 128 --X_path scripts/X.npy
2. Benchmark ProxiFlat
Test vector similarity search performance:
python scripts/bench_proxiss_flat.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2
3. Benchmark ProxiKNN
Test classification performance:
python scripts/bench_proxiss_knn.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2
4. Benchmark ProxiPCA
Test dimensionality reduction + similarity search performance:
# -c flag specifies n_components as percentage (0.0-1.0)
# Example: -c 0.065 means reduce to 6.5% of original dimensions
python scripts/bench_proxiss_pca.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2 -c 0.065
5. Compare with FAISS
Install FAISS and compare performance:
uv pip install faiss-cpu
python scripts/bench_faiss.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2
6. Compare with scikit-learn
Install scikit-learn and compare KNN classification performance:
uv pip install scikit-learn
python scripts/bench_sklearn_knn.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2
Example Usage
Interactive Inference
The examples/inference.py script demonstrates similarity search on real embeddings:
python examples/inference.py --embeddings examples/embeddings.npy --words examples/words.npy -k 5
This script loads pre-computed embeddings and allows interactive similarity search.
Development
Project Structure
-
Core C++ Implementation:
src/proxi_flat.cc,include/proxi_flat.h- Vector similarity searchsrc/proxi_knn.cc,include/proxi_knn.h- KNN classificationsrc/pca.cc,include/pca.h- PCA dimensionality reductionsrc/proxi_pca.cc,include/proxi_pca.h- PCA + similarity search wrappersrc/priority_queue.cc,include/priority_queue.h- Custom priority queueinclude/distance.hpp- Distance function implementations
-
Python Bindings:
bindings/proxi_flat_binding.cc- ProxiFlat Python interfacebindings/proxi_knn_binding.cc- ProxiKNN Python interfacebindings/proxi_pca_binding.cc- ProxiPCA Python interfaceproxiss/ProxiFlat.py- Python wrapper for ProxiFlatproxiss/ProxiKNN.py- Python wrapper for ProxiKNNproxiss/ProxiPCA.py- Python wrapper for ProxiPCA
-
Build System:
CMakeLists.txt- C++ build configuration with automatic dependenciespyproject.toml- Python package configuration
Running Tests
# Install test dependencies
uv pip install pytest
# Run all tests
python -m pytest tests/ -v
# Run specific tests
python -m pytest tests/test_proxi_flat.py -v
python -m pytest tests/test_proxi_knn.py -v
python -m pytest tests/test_proxi_pca.py -v
Building for Development
# Set up development environment
uv venv
source .venv/bin/activate
# Install development dependencies
uv pip install -r requirements.txt
# Reinstall after C++ changes
uv pip install -e . --force-reinstall --no-deps
API Reference
ProxiFlat Methods
__init__(k, num_threads, objective_function)- Initialize indexindex_data(embeddings)- Index vector datafind_indices(query)- Find nearest neighbor indicesfind_indices_batched(queries)- Batch query processingsave_state(filepath)- Save index to fileload_state(filepath)- Load index from file
ProxiKNN Methods
__init__(n_neighbours, n_jobs, distance_function)- Initialize classifierfit(features, labels)- Train on labeled datapredict(features)- Predict class labelssave_state(directory)- Save model to directoryload_state(directory)- Load model from directory
ProxiPCA Methods
__init__(k, num_threads, objective_function, n_components)- Initialize with PCA reductionfit_transform_index(embeddings)- Fit PCA, transform data, and indexfind_indices(query)- Find nearest neighbors (query auto-transformed)find_indices_batched(queries)- Batch query processinginsert_data(embeddings)- Insert new data (auto-transformed)get_n_components()- Get actual number of PCA components usedget_components()- Get PCA component vectorsget_mean()- Get PCA mean vectorget_explained_variance()- Get variance explained by each componentsave_state(filepath)- Save PCA transformation and indexload_state(filepath)- Load PCA transformation and index
License
Proxiss is licensed under the Apache License, Version 2.0. See LICENSE.txt for details.
Contributing
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
Proxiss - Fast Vector Similarity Search
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file proxiss-0.4.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: proxiss-0.4.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.14, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8deff51e2dc7fea09322181f83afaa3e0ee39af229487540294f1f03fdec8cec
|
|
| MD5 |
c606297d9f16300c78a3321498d3f17b
|
|
| BLAKE2b-256 |
a56b164621d6059867c3843a51bf7a23f8ab7d5e1198aef8d2598812d5a7f415
|
File details
Details for the file proxiss-0.4.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: proxiss-0.4.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.13, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7df68a08522b07fa9a6779fa14c38453d7ff756643439eab31f16cba11d8e8e
|
|
| MD5 |
568249b0b02931bdac931c13c8fa7ac8
|
|
| BLAKE2b-256 |
d739f29a80679153c518c1f87d62780f2731c92af673ff2ea751d0b34fe5b150
|
File details
Details for the file proxiss-0.4.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: proxiss-0.4.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.12, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43ed93a0c7cb09fcd7e40bd1af7e1887b75c8ca35e8d18d16d27e1bc6402ab4d
|
|
| MD5 |
179a1af24ac3b0f951caacdfe1be0f25
|
|
| BLAKE2b-256 |
0a15c1335dcfa2a8d7b86aebdc3b8c287334a47354cb67837f8e750351b04b8f
|
File details
Details for the file proxiss-0.4.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: proxiss-0.4.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.11, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4eb0d3f6203741ee1fcf45101e6791a82d9450e711a8fdc3e24ce3861659480f
|
|
| MD5 |
bdf4995a8aa08654ef2b7ea4f3741e59
|
|
| BLAKE2b-256 |
18e2ac2790d09a9d4df00b63322ac6d252f6be8f310e6886689eceb2f9774dfc
|
File details
Details for the file proxiss-0.4.1-cp310-cp310-manylinux_2_39_x86_64.whl.
File metadata
- Download URL: proxiss-0.4.1-cp310-cp310-manylinux_2_39_x86_64.whl
- Upload date:
- Size: 785.5 kB
- Tags: CPython 3.10, manylinux: glibc 2.39+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b94a83d3420e6a6bbac560226a30a7523aeb380b9a18b6aa44a2bc3cc03168c
|
|
| MD5 |
406650d23f9a3e1d970984e2baf0dd1b
|
|
| BLAKE2b-256 |
4999b8d26ee0603063538f697d9c13df9b5a8aad0005917f68d3dac807c5bf95
|
File details
Details for the file proxiss-0.4.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: proxiss-0.4.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.10, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ecc431aa761cbf0f7f164d620c4a24842a71bcf3a4941610b4aa7edd4c77b50
|
|
| MD5 |
91dd5a67942b7375f6f6b359a14ef1ac
|
|
| BLAKE2b-256 |
dd1fae5f0c9f012fce8df1bf2b5283a21c6a05249c6d426c146c4eb7acb38fab
|
File details
Details for the file proxiss-0.4.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: proxiss-0.4.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.9, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bfc08d71d911b1a7e4c2d7cee7ad3b9c29854d4b3b258e83a62ca4b8cffdbaf
|
|
| MD5 |
7305e2e1e09486a67115e5e67b85c42f
|
|
| BLAKE2b-256 |
ba349b1023a8fa030c1db1bdb9ffb1cab325ed512d758f291f10d0738bc67ed5
|