Skip to main content

Proxiss: Accelerating nearest-neighbor search for high-dimensional data!

Project description

Proxiss: Fast Vector Similarity Search

License

Proxiss is a high-performance C++ library with Python bindings, designed for fast vector similarity search in high-dimensional data. It provides efficient nearest-neighbor search capabilities for applications like semantic search, recommendation systems, and machine learning, currently optimized for Linux environments.

Key Features

  • High Performance: Optimized C++ implementation with OpenMP parallelization for fast k-NN searches
  • Multiple Distance Metrics: Supports common distance functions:
    • Euclidean (L2)
    • Manhattan (L1)
    • Cosine Similarity
  • Three Search Modes:
    • ProxiFlat: Vector-only indexing for pure similarity search
    • ProxiKNN: Classification-focused search with label storage
    • ProxiPCA: Dimensionality reduction combined with similarity search
  • Python Integration: Clean Python API powered by pybind11
  • Batched Operations: Efficient batch processing for multiple queries
  • Automatic Dependencies: CMake automatically downloads and configures required dependencies
  • Lightweight Design: Focused on core vector search functionality

Why Proxiss?

Vector similarity search is fundamental to many modern applications, but traditional methods can be slow and resource-intensive. Proxiss addresses this by:

  • Providing optimized C++ implementations with parallel processing
  • Offering clean, simple APIs that hide implementation complexity
  • Focusing on core functionality without unnecessary overhead
  • Supporting pure vector search, classification, and dimensionality reduction use cases

Installation

Proxiss builds from source with automatic dependency management.

Prerequisites

  • Linux environment (Ubuntu, Debian, CentOS, etc.)
  • Python 3.10 or higher
  • CMake 3.16 or higher
  • UV package manager

Note: The build system automatically installs clang++, OpenMP, Eigen3, and pybind11 if not found.

Building from Source

  1. Clone the repository:

    git clone https://github.com/BiradarSiddhant02/Proxiss.git
    cd Proxiss
    
  2. Install UV (if not already installed):

    curl -LsSf https://astral.sh/uv/install.sh | sh
    
  3. Create virtual environment and install:

    uv venv
    source .venv/bin/activate
    uv pip install . -v
    

Quick Start

ProxiFlat: Vector Similarity Search

from proxiss import ProxiFlat
import numpy as np

# Sample data
embeddings = np.array([
    [0.0, 0.0],
    [1.0, 1.0], 
    [2.0, 2.0],
    [3.0, 3.0]
], dtype=np.float32)

# Initialize ProxiFlat
px = ProxiFlat(k=2, num_threads=2, objective_function="l2")

# Index your vectors
px.index_data(embeddings)

# Query for nearest neighbors
query = np.array([1.5, 1.5], dtype=np.float32)
indices = px.find_indices(query)
print(f"Nearest neighbor indices: {indices}")

# Batch queries
queries = np.array([[0.5, 0.5], [2.5, 2.5]], dtype=np.float32)
batch_indices = px.find_indices_batched(queries)
print(f"Batch results: {batch_indices}")

# Save and load index
px.save_state("index.bin")
px_loaded = ProxiFlat(k=2, num_threads=2, objective_function="l2")
px_loaded.load_state("index.bin")

ProxiKNN: Classification Search

from proxiss import ProxiKNN
import numpy as np

# Sample data with labels
features = np.array([
    [0.0, 0.0], [1.0, 1.0],
    [5.0, 5.0], [6.0, 6.0]
], dtype=np.float32)
labels = np.array([0, 0, 1, 1], dtype=np.float32)

# Initialize and train
knn = ProxiKNN(n_neighbours=2, n_jobs=2, distance_function="l2")
knn.fit(features, labels)

# Predict
query = np.array([0.5, 0.5], dtype=np.float32)
prediction = knn.predict([query])
print(f"Predicted class: {prediction}")

# Save and load model
knn.save_state("model_dir")
knn_loaded = ProxiKNN(n_neighbours=2, n_jobs=2, distance_function="l2")
knn_loaded.load_state("model_dir")

ProxiPCA: Dimensionality Reduction + Search

from proxiss import ProxiPCA
import numpy as np

# High-dimensional sample data (e.g., 768-dimensional embeddings)
embeddings = np.random.randn(1000, 768).astype(np.float32)

# Initialize ProxiPCA with dimensionality reduction
# n_components as percentage: 0.065 means reduce to 6.5% of original dimensions
# For 768D data: 768 * 0.065 ≈ 50 dimensions
pca = ProxiPCA(k=5, num_threads=4, objective_function="l2", n_components=0.065)

# Fit PCA, transform data, and index in one step
pca.fit_transform_index(embeddings)

print(f"Original dimensions: {embeddings.shape[1]}")
print(f"Reduced dimensions: {pca.get_n_components()}")

# Query for nearest neighbors (query is automatically transformed)
query = np.random.randn(768).astype(np.float32)
indices = pca.find_indices(query)
print(f"Nearest neighbor indices: {indices}")

# Batch queries
queries = np.random.randn(10, 768).astype(np.float32)
batch_indices = pca.find_indices_batched(queries)
print(f"Batch results shape: {batch_indices.shape}")

# Insert new data (automatically transformed)
new_data = np.random.randn(100, 768).astype(np.float32)
pca.insert_data(new_data)

# Save and load (saves both PCA transformation and index)
pca.save_state("pca_index.bin")
pca_loaded = ProxiPCA(k=5, num_threads=4, objective_function="l2", n_components=0.065)
pca_loaded.load_state("pca_index.bin")

Benchmarking

Proxiss includes benchmarking scripts to evaluate performance.

1. Generate Test Data

Create synthetic datasets for benchmarking:

python scripts/make_data.py --N 10000 --D 128 --X_path scripts/X.npy

2. Benchmark ProxiFlat

Test vector similarity search performance:

python scripts/bench_proxiss_flat.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2

3. Benchmark ProxiKNN

Test classification performance:

python scripts/bench_proxiss_knn.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2

4. Benchmark ProxiPCA

Test dimensionality reduction + similarity search performance:

# -c flag specifies n_components as percentage (0.0-1.0)
# Example: -c 0.065 means reduce to 6.5% of original dimensions
python scripts/bench_proxiss_pca.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2 -c 0.065

5. Compare with FAISS

Install FAISS and compare performance:

uv pip install faiss-cpu
python scripts/bench_faiss.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2

6. Compare with scikit-learn

Install scikit-learn and compare KNN classification performance:

uv pip install scikit-learn
python scripts/bench_sklearn_knn.py --X_path scripts/X.npy -k 5 --threads 4 --objective l2

Example Usage

Interactive Inference

The examples/inference.py script demonstrates similarity search on real embeddings:

python examples/inference.py --embeddings examples/embeddings.npy --words examples/words.npy -k 5

This script loads pre-computed embeddings and allows interactive similarity search.

Development

Project Structure

  • Core C++ Implementation:

    • src/proxi_flat.cc, include/proxi_flat.h - Vector similarity search
    • src/proxi_knn.cc, include/proxi_knn.h - KNN classification
    • src/pca.cc, include/pca.h - PCA dimensionality reduction
    • src/proxi_pca.cc, include/proxi_pca.h - PCA + similarity search wrapper
    • src/priority_queue.cc, include/priority_queue.h - Custom priority queue
    • include/distance.hpp - Distance function implementations
  • Python Bindings:

    • bindings/proxi_flat_binding.cc - ProxiFlat Python interface
    • bindings/proxi_knn_binding.cc - ProxiKNN Python interface
    • bindings/proxi_pca_binding.cc - ProxiPCA Python interface
    • proxiss/ProxiFlat.py - Python wrapper for ProxiFlat
    • proxiss/ProxiKNN.py - Python wrapper for ProxiKNN
    • proxiss/ProxiPCA.py - Python wrapper for ProxiPCA
  • Build System:

    • CMakeLists.txt - C++ build configuration with automatic dependencies
    • pyproject.toml - Python package configuration

Running Tests

# Install test dependencies
uv pip install pytest

# Run all tests
python -m pytest tests/ -v

# Run specific tests
python -m pytest tests/test_proxi_flat.py -v
python -m pytest tests/test_proxi_knn.py -v
python -m pytest tests/test_proxi_pca.py -v

Building for Development

# Set up development environment
uv venv
source .venv/bin/activate

# Install development dependencies
uv pip install -r requirements.txt

# Reinstall after C++ changes
uv pip install -e . --force-reinstall --no-deps

API Reference

ProxiFlat Methods

  • __init__(k, num_threads, objective_function) - Initialize index
  • index_data(embeddings) - Index vector data
  • find_indices(query) - Find nearest neighbor indices
  • find_indices_batched(queries) - Batch query processing
  • save_state(filepath) - Save index to file
  • load_state(filepath) - Load index from file

ProxiKNN Methods

  • __init__(n_neighbours, n_jobs, distance_function) - Initialize classifier
  • fit(features, labels) - Train on labeled data
  • predict(features) - Predict class labels
  • save_state(directory) - Save model to directory
  • load_state(directory) - Load model from directory

ProxiPCA Methods

  • __init__(k, num_threads, objective_function, n_components) - Initialize with PCA reduction
  • fit_transform_index(embeddings) - Fit PCA, transform data, and index
  • find_indices(query) - Find nearest neighbors (query auto-transformed)
  • find_indices_batched(queries) - Batch query processing
  • insert_data(embeddings) - Insert new data (auto-transformed)
  • get_n_components() - Get actual number of PCA components used
  • get_components() - Get PCA component vectors
  • get_mean() - Get PCA mean vector
  • get_explained_variance() - Get variance explained by each component
  • save_state(filepath) - Save PCA transformation and index
  • load_state(filepath) - Load PCA transformation and index

License

Proxiss is licensed under the Apache License, Version 2.0. See LICENSE.txt for details.

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.


Proxiss - Fast Vector Similarity Search

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

proxiss-0.4.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

proxiss-0.4.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

proxiss-0.4.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

proxiss-0.4.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

proxiss-0.4.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file proxiss-0.4.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.4.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 063e00c82373b4668dece528dad38c48c92b26476cec094a6ea931be368404d6
MD5 8b9efb23bd992db4f84972147df77c0e
BLAKE2b-256 4284c90c934b2c0d9b75dfcba7984ab7b8ccebd3af30b753ee2836c133f7a6bc

See more details on using hashes here.

File details

Details for the file proxiss-0.4.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.4.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ec2264a6964138efecf63eb44b7cd1fbb1d8469de94fc5cc412f582e257bba39
MD5 24779425d8452118f081311f7101d359
BLAKE2b-256 bb8cedcd971abdbed5ed01d5f88cfaaa1a87eeda7c7eab11e2cb4d88701057ad

See more details on using hashes here.

File details

Details for the file proxiss-0.4.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.4.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 71a84c827f9e84d1220743dcbba544cd8a965d12a2d67fc6fe595a4025a71c80
MD5 3186dc962c4e6de28c544c6567a8650d
BLAKE2b-256 3002eeb71cc1b17273b99be5766d4c62414ea9efaeb901414f1bc45dff64c6a3

See more details on using hashes here.

File details

Details for the file proxiss-0.4.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.4.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8302321391192e3cbd196969f4a9b121a54f59ffaf2f9cda20ca77b0110e9046
MD5 5a8d34332b0662345c67afa439d60070
BLAKE2b-256 0d976bac21a93e2f4dbf33854af459d50994d4a70f6441cb3c53c372291fdb0b

See more details on using hashes here.

File details

Details for the file proxiss-0.4.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.4.0-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a5fe77bf6ee9731ed384b6dd494416b07b54b0d6ffe78de688bfd0e0746b4200
MD5 d37e51a7805e36b6786433ae9b669db0
BLAKE2b-256 15ef96b432340699c147446411b9eb5158c0f9c76df783e848ef9b345a5497ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page