Skip to main content

Proxi: Accelerating nearest-neighbor search for high-dimensional data!

Project description

Proxi: Fast Nearest Neighbor Search

License

Proxi is a high-performance C++ library with Python bindings, designed to accelerate nearest-neighbor search for high-dimensional data. Whether you're working on semantic search, recommendation systems, anomaly detection, or any application requiring fast similarity searches, Proxi offers an efficient and easy-to-use solution, currently optimized for Linux environments.

Key Features

  • Fast Performance: Leverages C++ for core computations and OpenMP for parallel processing to deliver high-speed k-NN searches.
  • Multiple Distance Metrics: Supports common distance functions:
    • Euclidean (L2)
    • Manhattan (L1)
    • Cosine Similarity
  • Python-Friendly API: Easy-to-use Python bindings powered by pybind11, making integration into your Python projects seamless. The main indexing and search functionalities are available through the ProxiFlat module.
  • Batched Operations: Efficiently process multiple queries at once with batched search methods.
  • Simple Indexing: Straightforward data indexing process.
  • Lightweight: Minimal dependencies, focused on delivering core k-NN functionality efficiently.

Why Proxi?

Searching for similar items in large, high-dimensional datasets is a common challenge. Traditional methods can be slow and computationally expensive. Proxi tackles this by:

  • Providing optimized C++ implementations of search algorithms.
  • Utilizing parallel processing to speed up computations on multi-core processors.
  • Offering a simple API that doesn't require deep expertise in low-level programming.

Installation

Proxi is built from source. Ensure you are in a Linux environment for optimal compatibility.

Prerequisites

  • A C++ compiler supporting C++20 (e.g., GCC, Clang)
  • CMake (version 3.15 or higher)
  • Python (version 3.8 or higher)
  • OpenMP (usually included with GCC; may require separate installation for Clang, e.g., sudo apt-get install libomp-dev on Debian/Ubuntu)

Building from Source

  1. Clone the repository:

    git clone https://github.com/BiradarSiddhant02/Proxi.git
    cd Proxi
    
  2. Set up a Python virtual environment (recommended):

    python3 -m venv .venv
    source .venv/bin/activate 
    
  3. Install build dependencies:

    pip install -r requirements.txt 
    

    (Ensure requirements.txt includes scikit-build-core and pybind11).

  4. Build and install Proxi:

    pip install .
    

    For development, you can use an editable install:

    pip install -e .
    

    This command invokes scikit-build-core which uses CMake to compile the C++ core and create the Python extension.

Quick Start

Here's a simple example of how to use Proxi in Python with the ProxiFlat module:

from proxi import ProxiFlat
import numpy as np

# 1. Sample data
embeddings = np.array([
    [0.0, 0.0],
    [1.0, 1.0],
    [2.0, 2.0],
    [3.0, 3.0]
], dtype=np.float32)
doc_ids = ["doc_a", "doc_b", "doc_c", "doc_d"]

# 2. Initialize ProxiFlat
# Parameters: k (number of neighbors), num_threads, objective_function ("l2", "l1", or "cos")
px = ProxiFlat(k=2, num_threads=2, objective_function="l2")

# 3. Index your data
px.index_data(embeddings.tolist(), doc_ids) # ProxiFlat expects lists of lists for embeddings

# 4. Prepare a query vector
query_vector = np.array([1.5, 1.5], dtype=np.float32)

# 5. Find nearest neighbor indices
indices = px.find_indices(query_vector.tolist())
print(f"Indices of nearest neighbors: {indices}")
# Example output: Indices of nearest neighbors: [1, 2] (or [2, 1] depending on exact distances)

# 6. Find nearest neighbor documents
docs = px.find_docs(query_vector.tolist())
print(f"Nearest documents: {docs}")
# Example output: Nearest documents: ['doc_b', 'doc_c']

# 7. Batched queries (for multiple queries at once)
query_batch = np.array([
    [0.5, 0.5],
    [2.5, 2.5]
], dtype=np.float32)

batch_indices = px.find_indices_batched(query_batch.tolist()) # Pass as list of lists
print(f"Batch indices: {batch_indices}")
# Example output: Batch indices: [[0, 1], [2, 3]]

batch_docs = px.find_docs_batched(query_batch.tolist()) # Pass as list of lists
print(f"Batch documents: {batch_docs}")
# Example output: Batch documents: [['doc_a', 'doc_b'], ['doc_c', 'doc_d']]

Benchmarking

Proxi includes scripts to benchmark its performance and to generate sample data.

1. Generate Mock Data

Use the scripts/make_data.py script to create synthetic datasets for benchmarking:

python scripts/make_data.py --N 10000 --D 128 --X_path X_data.npy --docs_path docs_data.npy

This will generate X_data.npy (feature vectors) and docs_data.npy (document identifiers).

2. Run Proxi Benchmark

Use scripts/bench_proxi.py to test Proxi's performance:

python scripts/bench_proxi.py --X_path X_data.npy --docs_path docs_data.npy -k 5 --threads 4 --objective l2

Adjust -k (number of neighbors), --threads, and --objective as needed.

3. Run FAISS Benchmark (for comparison)

If you have FAISS installed (pip install faiss-cpu or faiss-gpu), you can run a comparative benchmark:

python scripts/bench_faiss.py --X_path X_data.npy --docs_path docs_data.npy -k 5 --threads 4 --objective l2

Interactive Inference Example

Proxi includes an interactive script examples/inference.py that allows you to perform similarity searches on your own data and compare results with FAISS.

1. Download Embeddings and Corresponding Text

To use the inference script, you first need a dataset of embeddings and the corresponding text/words they represent.

For a demonstration, you can download pre-computed embeddings and words:

Download the zip file and unzip it.

2. Run the Inference Script

Navigate to the Proxi directory and run the script from your terminal:

python examples/inference.py --embeddings /path/to/your/embeddings.npy --words /path/to/your/words.npy -k 5

Arguments:

  • --embeddings: Path to the .npy file containing your numerical embeddings.
  • --words: Path to the .npy file containing the corresponding text entries.
  • -k: The number of nearest neighbors to retrieve for each query.

3. Interactive Search

Once the script loads the data and builds the Proxi and FAISS indexes, it will prompt you to enter a word or phrase:

Loading data...
Loaded 384000 embeddings with dimension 384

Building Proxi index...
Building FAISS index...
Loading sentence transformer model...

==================================================
Enter a word or phrase (or 'quit' to exit): your search query

Type your query and press Enter. The script will then display a table comparing the top-k results from Proxi and FAISS.

Similarity search results for: 'your search query'
+--------+-----------------+-----------------+
|   Rank | Proxi Results   | FAISS Results   |
+========+=================+=================+
|      1 | result_proxi_1  | result_faiss_1  |
|      2 | result_proxi_2  | result_faiss_2  |
|    ... | ...             | ...             |
+--------+-----------------+-----------------+

Enter 'quit' to exit the script.

This example provides a hands-on way to see Proxi in action.

Building and Development

  • The core indexing and search logic is implemented in C++ within the ProxiFlat class (src/proxi_flat.cc, include/proxi_flat.h).
  • Helper functions for distance calculations are in include/distance.hpp.
  • Python bindings are defined in bindings/proxi_binding.cc.
  • CMakeLists.txt manages the C++ build process.
  • pyproject.toml and scikit-build-core handle the Python package build and C++ compilation.
  • Tests are located in tests/test_proxi_flat.py. Run them using unittest:
    python -m unittest tests/test_proxi_flat.py
    

License

Proxi is licensed under the Apache License, Version 2.0. See the LICENSE.txt file for details.

Contributing

Contributions are welcome! If you have suggestions, bug reports, or want to contribute code, please feel free to open an issue or submit a pull request.


Happy Searching with Proxi!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

proxiss-0.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (565.2 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

proxiss-0.2.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (562.7 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

proxiss-0.2.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (562.5 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

proxiss-0.2.0-cp313-cp313-musllinux_1_2_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

proxiss-0.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (559.4 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

proxiss-0.2.0-cp312-cp312-musllinux_1_2_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

proxiss-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (559.5 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

proxiss-0.2.0-cp311-cp311-musllinux_1_2_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

proxiss-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (566.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

proxiss-0.2.0-cp310-cp310-musllinux_1_2_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

proxiss-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (562.5 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

proxiss-0.2.0-cp39-cp39-musllinux_1_2_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9musllinux: musl 1.2+ x86-64

proxiss-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (564.2 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

File details

Details for the file proxiss-0.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b6ad3abc81d1bbd02b11519bcdcea23d872dcd74824fb5f68265f680fba4a46e
MD5 2494f882b5e7125d669d884c0cf09349
BLAKE2b-256 457734bb88da9d166ae685365ee51652316b9f9f591f17c12a9f1deaeafe88ce

See more details on using hashes here.

File details

Details for the file proxiss-0.2.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7050c7b1ff02d500ed0edb9422f06563556d15c6740d77bb3b0a76eca3083dba
MD5 8aef2c1098f5bc3ff3f4b4b90c0e7f36
BLAKE2b-256 943a6dbbfe0a5242c81bfa9ae5ab99bb01164eabd9cabb08c7e9ff794fed6a61

See more details on using hashes here.

File details

Details for the file proxiss-0.2.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 50fb2a3c52e7999e74f4cf157493f9ff4e0a67ae41489fc176f231364e07a723
MD5 cd421e5c11ab17cea4855518272343f2
BLAKE2b-256 b7f63b9c22dae04f4d60782939e29f14d9cc758bfc53411514c5092a715de125

See more details on using hashes here.

File details

Details for the file proxiss-0.2.0-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d425b61bd6fb1a2d300c3987b83b3ab7cc79423239ffcd8aecedb01f48a6d9e6
MD5 bbaa0d27bcbbb4ce133b55891c2dae5e
BLAKE2b-256 d7777e25ca868c497df2357c1d4a7e572ba85b0ce0ea965ea1fe23a8ad6106b6

See more details on using hashes here.

File details

Details for the file proxiss-0.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5e3d9385a3fb7f83cb40e6782111da61788c93a7a1ae30560b5eaf6211acee85
MD5 0b774e7aff48424856c068c815834ce8
BLAKE2b-256 13451348b6ccd34dbfd8cfe751a088dfad8a7bc5431b3f56b1049538d87e8262

See more details on using hashes here.

File details

Details for the file proxiss-0.2.0-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 67e00c1b10357290d37e2bbbc2724cc1623f3813a0c7db5f6219138daaa6cb5d
MD5 9819cf4fe44e19b3b2fc354ec1928b38
BLAKE2b-256 354404a41a89c5b78fa99f1232497c2f44b67f5b377b62d9947a92e14a931150

See more details on using hashes here.

File details

Details for the file proxiss-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5ac56a0880260ef31edea4d8fc35385e9ed78c4d80ce7db116523d8a1c9ea3be
MD5 11a03c0bdc1a6900b8f359915efbb7de
BLAKE2b-256 909a122e401d009e8c6aabfec4b822ed80a383bc62a303b343276ddb4f6f953c

See more details on using hashes here.

File details

Details for the file proxiss-0.2.0-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 4ac38a0b24934c22dec48aeaa86517566e850e0146afe523fead06e3c0eb8bd8
MD5 9014337b84a96f58798b64405bfebd9f
BLAKE2b-256 475b847c091bd17d10b971a948f037b134fc8a9f52585559d4afaaf9df09d134

See more details on using hashes here.

File details

Details for the file proxiss-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 61900c35373b272a613daf397e4461d9b54ca080bd410b6b815687ec340c4e30
MD5 21d92732f9dc6f711490f8cdeb0a2556
BLAKE2b-256 53fe947db180e469cf137fc3f0f815f7c8c8ff1e96f2f0a2bfc1adfe3de2df6a

See more details on using hashes here.

File details

Details for the file proxiss-0.2.0-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 02e1d577606b2f41e006f3b8493c147f4e0bea470e01281bfabc159c464f5cf9
MD5 b4ac28594b9f685659aa5bfee5df4a25
BLAKE2b-256 3fa31870349925f794371a50b5ce2f6363b2b3d95994d021942fe7daf6502ba1

See more details on using hashes here.

File details

Details for the file proxiss-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 423eb038e3a3a2389b21e94348b3c9905df21c2d1953ade53ca08d6864af4719
MD5 a96389d9937f4ee3ff897531cc49f2c5
BLAKE2b-256 0d2529052f63dff9a4d0a2e3893ba8698b0d489dc7cfabd5af82e026add4c190

See more details on using hashes here.

File details

Details for the file proxiss-0.2.0-cp39-cp39-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-cp39-cp39-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 440900f4341731604156364885166d463aadb61ee36ecd41c9f15f404ca6641f
MD5 aa8705172735b0d94e9882aa4c17345a
BLAKE2b-256 14e800ed55ecc495cfff053a4eff25ab77b65fdd52b7799501b8767930b288f7

See more details on using hashes here.

File details

Details for the file proxiss-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 089c9102370f85103ac92f1d264cfc3e467bc42704d3b154b0848e6381c55804
MD5 1836b2353f1b67c338f72b60618c652e
BLAKE2b-256 48cc59bc7f540cab0caeb746bd8fc0ec13ceb0fc064cf6b83719703012344d1c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page