Skip to main content

Proxiss: Accelerating nearest-neighbor search for high-dimensional data!

Project description

Proxiss: Fast Vector Similarity Search

License

Proxiss is a high-performance C++ library with Python bindings, designed for fast vector similarity search in high-dimensional data. It provides efficient nearest-neighbor search capabilities for applications like semantic search, recommendation systems, and machine learning, currently optimized for Linux environments.

Key Features

  • High Performance: Optimized C++ implementation with OpenMP parallelization for fast k-NN searches
  • Multiple Distance Metrics: Supports common distance functions:
    • Euclidean (L2)
    • Manhattan (L1)
    • Cosine Similarity
  • Three Search Modes:
    • ProxiFlat: Vector-only indexing for pure similarity search
    • ProxiKNN: Classification-focused search with label storage
    • ProxiPCA: Dimensionality reduction combined with similarity search
  • Batched Operations: Efficient batch processing for multiple queries
  • Python Integration

Why Proxiss?

Vector similarity search is fundamental to many modern applications, but traditional methods can be slow and resource-intensive. Proxiss addresses this by:

  • Providing optimized C++ implementations with parallel processing
  • Offering clean, simple APIs that hide implementation complexity
  • Focusing on core functionality without unnecessary overhead
  • Supporting pure vector search, classification, and dimensionality reduction use cases

Installation

Proxiss builds from source with automatic dependency management. or from PyPI https://pypi.org/project/proxiss/

Prerequisites

  • Linux environment (Ubuntu, Debian, CentOS, etc.)
  • Python 3.10 or higher
  • CMake 3.16 or higher
  • UV package manager

Note: The build system automatically installs clang++, OpenMP and pybind11 if not found.

Building from Source

  1. Clone the repository:

    git clone https://github.com/BiradarSiddhant02/Proxiss.git
    cd Proxiss
    
  2. Install UV (if not already installed):

    curl -LsSf https://astral.sh/uv/install.sh | sh
    
  3. Create virtual environment and install:

    uv venv
    source .venv/bin/activate
    uv pip install . -v
    

Quick Start

ProxiFlat: Vector Similarity Search

from proxiss import ProxiFlat
import numpy as np

# Sample data
embeddings = np.array([
    [0.0, 0.0],
    [1.0, 1.0], 
    [2.0, 2.0],
    [3.0, 3.0]
], dtype=np.float32)

# Initialize ProxiFlat
px = ProxiFlat(k=2, num_threads=2, objective_function="l2")

# Index your vectors
px.index_data(embeddings)

# Query for nearest neighbors
query = np.array([1.5, 1.5], dtype=np.float32)
indices = px.find_indices(query)
print(f"Nearest neighbor indices: {indices}")

# Batch queries
queries = np.array([[0.5, 0.5], [2.5, 2.5]], dtype=np.float32)
batch_indices = px.find_indices_batched(queries)
print(f"Batch results: {batch_indices}")

# Save and load index
px.save_state("index.bin")
px_loaded = ProxiFlat(k=2, num_threads=2, objective_function="l2")
px_loaded.load_state("index.bin")

ProxiKNN: Classification Search

from proxiss import ProxiKNN
import numpy as np

# Sample data with labels
features = np.array([
    [0.0, 0.0], [1.0, 1.0],
    [5.0, 5.0], [6.0, 6.0]
], dtype=np.float32)
labels = np.array([0, 0, 1, 1], dtype=np.float32)

# Initialize and train
knn = ProxiKNN(n_neighbours=2, n_jobs=2, distance_function="l2")
knn.fit(features, labels)

# Predict
query = np.array([0.5, 0.5], dtype=np.float32)
prediction = knn.predict([query])
print(f"Predicted class: {prediction}")

# Save and load model
knn.save_state("model_dir")
knn_loaded = ProxiKNN(n_neighbours=2, n_jobs=2, distance_function="l2")
knn_loaded.load_state("model_dir")

ProxiPCA: Dimensionality Reduction + Search

from proxiss import ProxiPCA
import numpy as np

# High-dimensional sample data (e.g., 768-dimensional embeddings)
embeddings = np.random.randn(1000, 768).astype(np.float32)

# Initialize ProxiPCA with dimensionality reduction
# n_components as percentage: 0.065 means reduce to 6.5% of original dimensions
# For 768D data: 768 * 0.065 ≈ 50 dimensions
pca = ProxiPCA(k=5, num_threads=4, objective_function="l2", n_components=0.065)

# Fit PCA, transform data, and index in one step
pca.fit_transform_index(embeddings)

print(f"Original dimensions: {embeddings.shape[1]}")
print(f"Reduced dimensions: {pca.get_n_components()}")

# Query for nearest neighbors (query is automatically transformed)
query = np.random.randn(768).astype(np.float32)
indices = pca.find_indices(query)
print(f"Nearest neighbor indices: {indices}")

# Batch queries
queries = np.random.randn(10, 768).astype(np.float32)
batch_indices = pca.find_indices_batched(queries)
print(f"Batch results shape: {batch_indices.shape}")

# Insert new data (automatically transformed)
new_data = np.random.randn(100, 768).astype(np.float32)
pca.insert_data(new_data)

# Save and load (saves both PCA transformation and index)
pca.save_state("pca_index.bin")
pca_loaded = ProxiPCA(k=5, num_threads=4, objective_function="l2", n_components=0.065)
pca_loaded.load_state("pca_index.bin")

Benchmarking

Proxiss includes benchmarking scripts to evaluate performance.

1. Generate Test Data

Create synthetic datasets for benchmarking:

python scripts/make_data.py --N 10000 --D 128 --X_path scripts/X.npy

2. Benchmark ProxiFlat

Test vector similarity search performance:

python scripts/bench_proxiss_flat.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2

3. Benchmark ProxiKNN

Test classification performance:

python scripts/bench_proxiss_knn.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2

4. Benchmark ProxiPCA

Test dimensionality reduction + similarity search performance:

# -c flag specifies n_components as percentage (0.0-1.0)
# Example: -c 0.065 means reduce to 6.5% of original dimensions
python scripts/bench_proxiss_pca.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2 -c 0.065

5. Compare with FAISS

Install FAISS and compare performance:

uv pip install faiss-cpu
python scripts/bench_faiss.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2

6. Compare with scikit-learn

Install scikit-learn and compare KNN classification performance:

uv pip install scikit-learn
python scripts/bench_sklearn_knn.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2

Example Usage

Interactive Inference

The examples/inference.py script demonstrates similarity search on real embeddings:

python examples/inference.py --embeddings examples/embeddings.npy --words examples/words.npy -k 5

This script loads pre-computed embeddings and allows interactive similarity search.

Development

Project Structure

  • Core C++ Implementation:

    • src/proxi_flat.cc, include/proxi_flat.h - Vector similarity search
    • src/proxi_knn.cc, include/proxi_knn.h - KNN classification
    • src/pca.cc, include/pca.h - PCA dimensionality reduction
    • src/proxi_pca.cc, include/proxi_pca.h - PCA + similarity search wrapper
    • src/priority_queue.cc, include/priority_queue.h - Custom priority queue
    • include/distance.hpp - Distance function implementations
  • Python Bindings:

    • bindings/proxi_flat_binding.cc - ProxiFlat Python interface
    • bindings/proxi_knn_binding.cc - ProxiKNN Python interface
    • bindings/proxi_pca_binding.cc - ProxiPCA Python interface
    • proxiss/ProxiFlat.py - Python wrapper for ProxiFlat
    • proxiss/ProxiKNN.py - Python wrapper for ProxiKNN
    • proxiss/ProxiPCA.py - Python wrapper for ProxiPCA
  • Build System:

    • CMakeLists.txt - C++ build configuration with automatic dependencies
    • pyproject.toml - Python package configuration

Running Tests

# Install test dependencies
uv pip install pytest

# Run all tests
python -m pytest tests/ -v

# Run specific tests
python -m pytest tests/test_proxi_flat.py -v
python -m pytest tests/test_proxi_knn.py -v
python -m pytest tests/test_proxi_pca.py -v

Building for Development

# Set up development environment
uv venv
source .venv/bin/activate

# Install development dependencies
uv pip install -r requirements.txt

# Reinstall after C++ changes
uv pip install -e . --force-reinstall --no-deps

API Reference

ProxiFlat Methods

  • __init__(k, num_threads, objective_function) - Initialize index
  • index_data(embeddings) - Index vector data
  • find_indices(query) - Find nearest neighbor indices
  • find_indices_batched(queries) - Batch query processing
  • save_state(filepath) - Save index to file
  • load_state(filepath) - Load index from file

ProxiKNN Methods

  • __init__(n_neighbours, n_jobs, distance_function) - Initialize classifier
  • fit(features, labels) - Train on labeled data
  • predict(features) - Predict class labels
  • save_state(directory) - Save model to directory
  • load_state(directory) - Load model from directory

ProxiPCA Methods

  • __init__(k, num_threads, objective_function, n_components) - Initialize with PCA reduction
  • fit_transform_index(embeddings) - Fit PCA, transform data, and index
  • find_indices(query) - Find nearest neighbors (query auto-transformed)
  • find_indices_batched(queries) - Batch query processing
  • insert_data(embeddings) - Insert new data (auto-transformed)
  • get_n_components() - Get actual number of PCA components used
  • get_components() - Get PCA component vectors
  • get_mean() - Get PCA mean vector
  • get_explained_variance() - Get variance explained by each component
  • save_state(filepath) - Save PCA transformation and index
  • load_state(filepath) - Load PCA transformation and index

License

Proxiss is licensed under the Apache License, Version 2.0. See LICENSE.txt for details.

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.


Proxiss - Fast Vector Similarity Search

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

proxiss-0.4.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

proxiss-0.4.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

proxiss-0.4.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

proxiss-0.4.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

proxiss-0.4.1-cp310-cp310-manylinux_2_39_x86_64.whl (785.5 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.39+ x86-64

proxiss-0.4.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

proxiss-0.4.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file proxiss-0.4.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.4.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8deff51e2dc7fea09322181f83afaa3e0ee39af229487540294f1f03fdec8cec
MD5 c606297d9f16300c78a3321498d3f17b
BLAKE2b-256 a56b164621d6059867c3843a51bf7a23f8ab7d5e1198aef8d2598812d5a7f415

See more details on using hashes here.

File details

Details for the file proxiss-0.4.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.4.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e7df68a08522b07fa9a6779fa14c38453d7ff756643439eab31f16cba11d8e8e
MD5 568249b0b02931bdac931c13c8fa7ac8
BLAKE2b-256 d739f29a80679153c518c1f87d62780f2731c92af673ff2ea751d0b34fe5b150

See more details on using hashes here.

File details

Details for the file proxiss-0.4.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.4.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 43ed93a0c7cb09fcd7e40bd1af7e1887b75c8ca35e8d18d16d27e1bc6402ab4d
MD5 179a1af24ac3b0f951caacdfe1be0f25
BLAKE2b-256 0a15c1335dcfa2a8d7b86aebdc3b8c287334a47354cb67837f8e750351b04b8f

See more details on using hashes here.

File details

Details for the file proxiss-0.4.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.4.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4eb0d3f6203741ee1fcf45101e6791a82d9450e711a8fdc3e24ce3861659480f
MD5 bdf4995a8aa08654ef2b7ea4f3741e59
BLAKE2b-256 18e2ac2790d09a9d4df00b63322ac6d252f6be8f310e6886689eceb2f9774dfc

See more details on using hashes here.

File details

Details for the file proxiss-0.4.1-cp310-cp310-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.4.1-cp310-cp310-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 7b94a83d3420e6a6bbac560226a30a7523aeb380b9a18b6aa44a2bc3cc03168c
MD5 406650d23f9a3e1d970984e2baf0dd1b
BLAKE2b-256 4999b8d26ee0603063538f697d9c13df9b5a8aad0005917f68d3dac807c5bf95

See more details on using hashes here.

File details

Details for the file proxiss-0.4.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.4.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3ecc431aa761cbf0f7f164d620c4a24842a71bcf3a4941610b4aa7edd4c77b50
MD5 91dd5a67942b7375f6f6b359a14ef1ac
BLAKE2b-256 dd1fae5f0c9f012fce8df1bf2b5283a21c6a05249c6d426c146c4eb7acb38fab

See more details on using hashes here.

File details

Details for the file proxiss-0.4.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.4.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0bfc08d71d911b1a7e4c2d7cee7ad3b9c29854d4b3b258e83a62ca4b8cffdbaf
MD5 7305e2e1e09486a67115e5e67b85c42f
BLAKE2b-256 ba349b1023a8fa030c1db1bdb9ffb1cab325ed512d758f291f10d0738bc67ed5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page