Skip to main content

Proxi: Accelerating nearest-neighbor search for high-dimensional data!

Project description

Proxi: Fast Nearest Neighbor Search

License

Proxi is a high-performance C++ library with Python bindings, designed to accelerate nearest-neighbor search for high-dimensional data. Whether you're working on semantic search, recommendation systems, anomaly detection, or any application requiring fast similarity searches, Proxi offers an efficient and easy-to-use solution, currently optimized for Linux environments.

Key Features

  • Fast Performance: Leverages C++ for core computations and OpenMP for parallel processing to deliver high-speed k-NN searches.
  • Multiple Distance Metrics: Supports common distance functions:
    • Euclidean (L2)
    • Manhattan (L1)
    • Cosine Similarity
  • Python-Friendly API: Easy-to-use Python bindings powered by pybind11, making integration into your Python projects seamless. The main indexing and search functionalities are available through the ProxiFlat module.
  • Batched Operations: Efficiently process multiple queries at once with batched search methods.
  • Simple Indexing: Straightforward data indexing process.
  • Lightweight: Minimal dependencies, focused on delivering core k-NN functionality efficiently.

Why Proxi?

Searching for similar items in large, high-dimensional datasets is a common challenge. Traditional methods can be slow and computationally expensive. Proxi tackles this by:

  • Providing optimized C++ implementations of search algorithms.
  • Utilizing parallel processing to speed up computations on multi-core processors.
  • Offering a simple API that doesn't require deep expertise in low-level programming.

Installation

Proxi is built from source. Ensure you are in a Linux environment for optimal compatibility.

Prerequisites

  • A C++ compiler supporting C++20 (e.g., GCC, Clang)
  • CMake (version 3.15 or higher)
  • Python (version 3.8 or higher)
  • OpenMP (usually included with GCC; may require separate installation for Clang, e.g., sudo apt-get install libomp-dev on Debian/Ubuntu)

Building from Source

  1. Clone the repository:

    git clone https://github.com/BiradarSiddhant02/Proxi.git
    cd Proxi
    
  2. Set up a Python virtual environment (recommended):

    python3 -m venv .venv
    source .venv/bin/activate 
    
  3. Install build dependencies:

    pip install -r requirements.txt 
    

    (Ensure requirements.txt includes scikit-build-core and pybind11).

  4. Build and install Proxi:

    pip install .
    

    For development, you can use an editable install:

    pip install -e .
    

    This command invokes scikit-build-core which uses CMake to compile the C++ core and create the Python extension.

Quick Start

Here's a simple example of how to use Proxi in Python with the ProxiFlat module:

from proxi import ProxiFlat
import numpy as np

# 1. Sample data
embeddings = np.array([
    [0.0, 0.0],
    [1.0, 1.0],
    [2.0, 2.0],
    [3.0, 3.0]
], dtype=np.float32)
doc_ids = ["doc_a", "doc_b", "doc_c", "doc_d"]

# 2. Initialize ProxiFlat
# Parameters: k (number of neighbors), num_threads, objective_function ("l2", "l1", or "cos")
px = ProxiFlat(k=2, num_threads=2, objective_function="l2")

# 3. Index your data
px.index_data(embeddings.tolist(), doc_ids) # ProxiFlat expects lists of lists for embeddings

# 4. Prepare a query vector
query_vector = np.array([1.5, 1.5], dtype=np.float32)

# 5. Find nearest neighbor indices
indices = px.find_indices(query_vector.tolist())
print(f"Indices of nearest neighbors: {indices}")
# Example output: Indices of nearest neighbors: [1, 2] (or [2, 1] depending on exact distances)

# 6. Find nearest neighbor documents
docs = px.find_docs(query_vector.tolist())
print(f"Nearest documents: {docs}")
# Example output: Nearest documents: ['doc_b', 'doc_c']

# 7. Batched queries (for multiple queries at once)
query_batch = np.array([
    [0.5, 0.5],
    [2.5, 2.5]
], dtype=np.float32)

batch_indices = px.find_indices_batched(query_batch.tolist()) # Pass as list of lists
print(f"Batch indices: {batch_indices}")
# Example output: Batch indices: [[0, 1], [2, 3]]

batch_docs = px.find_docs_batched(query_batch.tolist()) # Pass as list of lists
print(f"Batch documents: {batch_docs}")
# Example output: Batch documents: [['doc_a', 'doc_b'], ['doc_c', 'doc_d']]

Benchmarking

Proxi includes scripts to benchmark its performance and to generate sample data.

1. Generate Mock Data

Use the scripts/make_data.py script to create synthetic datasets for benchmarking:

python scripts/make_data.py --N 10000 --D 128 --X_path X_data.npy --docs_path docs_data.npy

This will generate X_data.npy (feature vectors) and docs_data.npy (document identifiers).

2. Run Proxi Benchmark

Use scripts/bench_proxi.py to test Proxi's performance:

python scripts/bench_proxi.py --X_path X_data.npy --docs_path docs_data.npy -k 5 --threads 4 --objective l2

Adjust -k (number of neighbors), --threads, and --objective as needed.

3. Run FAISS Benchmark (for comparison)

If you have FAISS installed (pip install faiss-cpu or faiss-gpu), you can run a comparative benchmark:

python scripts/bench_faiss.py --X_path X_data.npy --docs_path docs_data.npy -k 5 --threads 4 --objective l2

Interactive Inference Example

Proxi includes an interactive script examples/inference.py that allows you to perform similarity searches on your own data and compare results with FAISS.

1. Download Embeddings and Corresponding Text

To use the inference script, you first need a dataset of embeddings and the corresponding text/words they represent.

For a demonstration, you can download pre-computed embeddings and words:

Download the zip file and unzip it.

2. Run the Inference Script

Navigate to the Proxi directory and run the script from your terminal:

python examples/inference.py --embeddings /path/to/your/embeddings.npy --words /path/to/your/words.npy -k 5

Arguments:

  • --embeddings: Path to the .npy file containing your numerical embeddings.
  • --words: Path to the .npy file containing the corresponding text entries.
  • -k: The number of nearest neighbors to retrieve for each query.

3. Interactive Search

Once the script loads the data and builds the Proxi and FAISS indexes, it will prompt you to enter a word or phrase:

Loading data...
Loaded 384000 embeddings with dimension 384

Building Proxi index...
Building FAISS index...
Loading sentence transformer model...

==================================================
Enter a word or phrase (or 'quit' to exit): your search query

Type your query and press Enter. The script will then display a table comparing the top-k results from Proxi and FAISS.

Similarity search results for: 'your search query'
+--------+-----------------+-----------------+
|   Rank | Proxi Results   | FAISS Results   |
+========+=================+=================+
|      1 | result_proxi_1  | result_faiss_1  |
|      2 | result_proxi_2  | result_faiss_2  |
|    ... | ...             | ...             |
+--------+-----------------+-----------------+

Enter 'quit' to exit the script.

This example provides a hands-on way to see Proxi in action.

Building and Development

  • The core indexing and search logic is implemented in C++ within the ProxiFlat class (src/proxi_flat.cc, include/proxi_flat.h).
  • Helper functions for distance calculations are in include/distance.hpp.
  • Python bindings are defined in bindings/proxi_binding.cc.
  • CMakeLists.txt manages the C++ build process.
  • pyproject.toml and scikit-build-core handle the Python package build and C++ compilation.
  • Tests are located in tests/test_proxi_flat.py. Run them using unittest:
    python -m unittest tests/test_proxi_flat.py
    

License

Proxi is licensed under the Apache License, Version 2.0. See the LICENSE.txt file for details.

Contributing

Contributions are welcome! If you have suggestions, bug reports, or want to contribute code, please feel free to open an issue or submit a pull request.


Happy Searching with Proxi!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

proxiss-0.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (328.8 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

proxiss-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (327.4 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

proxiss-0.1.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (327.3 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

proxiss-0.1.1-cp313-cp313-musllinux_1_2_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

proxiss-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (325.2 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

proxiss-0.1.1-cp312-cp312-musllinux_1_2_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

proxiss-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (325.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

proxiss-0.1.1-cp311-cp311-musllinux_1_2_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

proxiss-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (329.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

proxiss-0.1.1-cp310-cp310-musllinux_1_2_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

proxiss-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (327.0 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

proxiss-0.1.1-cp39-cp39-musllinux_1_2_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9musllinux: musl 1.2+ x86-64

proxiss-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (328.2 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

File details

Details for the file proxiss-0.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dddd42e1730353fa159fa77fab0df181c0cfdf53f9ee720ee139f301ba930a76
MD5 4f533f57621f95b825900cef481004c3
BLAKE2b-256 221324682ee346a889d08d5135d8ffde420f40d47a03ec0a3f948eb38f5d3c47

See more details on using hashes here.

File details

Details for the file proxiss-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d7b30d90dfd6964ce9bd0dd001e21261c649e223ed9e507152ebf2a164f090c2
MD5 151d712188ff2e99cfcaed75bf347660
BLAKE2b-256 522dd1dc94d6999ed34b0dac08605ad259ecf082c72737e860f31a89882ffc28

See more details on using hashes here.

File details

Details for the file proxiss-0.1.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1d7bbaf42cf1f6b68d5824d5cda701de580b608e2f69a71144f258d604538bef
MD5 2efb802ad06a0b5214d85a11c80e5ee5
BLAKE2b-256 6382bbc6323335b0184da79a609acf66878a4878b5eac2db8d9dfc131920cc84

See more details on using hashes here.

File details

Details for the file proxiss-0.1.1-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 4021393fd43c516a7686ca68c5db50664b0e0b3942602468ec84556afd0ce3ff
MD5 65a79949d93ebe288f6817f1afb1f54d
BLAKE2b-256 6178ef881a75e3f3a943c1a564c7e244a7aaf573b6f62a54536cab487ade8032

See more details on using hashes here.

File details

Details for the file proxiss-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 205a2db77dfa82e55f5e23d06b7e28f73820f658d8ac1b84cdb42c235c9f5453
MD5 e8b17a6e5c138e54bd989cf9066f1c08
BLAKE2b-256 b13f60a277f1dd6792e8d63702d2b6bf4f516ebfc1487ad0e98e0efb381f16eb

See more details on using hashes here.

File details

Details for the file proxiss-0.1.1-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 3811f3e7c3db96a6c0fca1fa883c046932c730cc88ecc48abb40906113e057f8
MD5 3245029195cd87d36405d74525ae013d
BLAKE2b-256 76684dbf8e949ce79a1e932d00f3612af1bd3609f50f2c0b7fa2c026456827ef

See more details on using hashes here.

File details

Details for the file proxiss-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 aa6674e05c4d189aa15479637190ef5a5c0a372dbfdacd111f55a9f23326b353
MD5 3e375a99d270fa41168df9fc97d71f45
BLAKE2b-256 d3d925f1dd97f7623beb30f05b72356543362f18481750b88dd3351a9be48f31

See more details on using hashes here.

File details

Details for the file proxiss-0.1.1-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d4cbaa1137a1a9ada46706f669de75f38a2d444f63163cec9492093920407cf4
MD5 2c52925ff54db2b599c71307d6210f55
BLAKE2b-256 78ed6c4f9f27fdacba50684700d07413c4b1d56918342bc13d851e34f5788729

See more details on using hashes here.

File details

Details for the file proxiss-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 282dedd200da56a095ef96e02cda90edccdc95058ee13dfc53d965c27988792d
MD5 19370c7ab8beb4cdd5d4dda1c214e41b
BLAKE2b-256 c8d45f09479ced8861bbf3a8d19a2fe209bd68eae678d4f9e5040fa8300fc270

See more details on using hashes here.

File details

Details for the file proxiss-0.1.1-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 1d3b6b3c70e0ce7a42e83b27640efae7cc66fbf266811929f7ed3e7b17c8d52f
MD5 e3a8b7d47bb17e093e1bedf431d4a084
BLAKE2b-256 861afa8e801ad0da00c25ad229e9ceb8aa57ce7ae0e35db297814a0531ec735c

See more details on using hashes here.

File details

Details for the file proxiss-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9e0f088148c031ac0fb0d4fb75f025dd3160faf9990f0fc45ab5b1e86134d487
MD5 fde4b5638cb40847fbf337d239e6594b
BLAKE2b-256 2db5b386932803389f44419a51a0e83b9253326678765409be0c16136ac52828

See more details on using hashes here.

File details

Details for the file proxiss-0.1.1-cp39-cp39-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-cp39-cp39-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 0bbc8c0dd9d207ab4ea653c36e8d921c3e8d1702bf30b7b624cc4adda7f486e9
MD5 348a1080511fd823ee49559f67d17dab
BLAKE2b-256 ac6d16ea7e839312dc3636b9ef4a8919cf19555fef072e8f499c626b76644a94

See more details on using hashes here.

File details

Details for the file proxiss-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for proxiss-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1e2c48b02d5170c2c53c6f8497da4a4a1c32e5a4fe14abcf1bc0e4f136b39436
MD5 7a1774f22c17c6516851d7dcd6c8e83e
BLAKE2b-256 68a81c09853acb95a42e05ec1486591a7d3af068d16c76ff2680abffb6c17540

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page