Proxi: Accelerating nearest-neighbor search for high-dimensional data!
Project description
Proxi: Fast Nearest Neighbor Search
Proxi is a high-performance C++ library with Python bindings, designed to accelerate nearest-neighbor search for high-dimensional data. Whether you're working on semantic search, recommendation systems, anomaly detection, or any application requiring fast similarity searches, Proxi offers an efficient and easy-to-use solution, currently optimized for Linux environments.
Key Features
- Fast Performance: Leverages C++ for core computations and OpenMP for parallel processing to deliver high-speed k-NN searches.
- Multiple Distance Metrics: Supports common distance functions:
- Euclidean (L2)
- Manhattan (L1)
- Cosine Similarity
- Python-Friendly API: Easy-to-use Python bindings powered by pybind11, making integration into your Python projects seamless. The main indexing and search functionalities are available through the
ProxiFlatmodule. - Batched Operations: Efficiently process multiple queries at once with batched search methods.
- Simple Indexing: Straightforward data indexing process.
- Lightweight: Minimal dependencies, focused on delivering core k-NN functionality efficiently.
Why Proxi?
Searching for similar items in large, high-dimensional datasets is a common challenge. Traditional methods can be slow and computationally expensive. Proxi tackles this by:
- Providing optimized C++ implementations of search algorithms.
- Utilizing parallel processing to speed up computations on multi-core processors.
- Offering a simple API that doesn't require deep expertise in low-level programming.
Installation
Proxi is built from source. Ensure you are in a Linux environment for optimal compatibility.
Prerequisites
- A C++ compiler supporting C++20 (e.g., GCC, Clang)
- CMake (version 3.15 or higher)
- Python (version 3.8 or higher)
- OpenMP (usually included with GCC; may require separate installation for Clang, e.g.,
sudo apt-get install libomp-devon Debian/Ubuntu)
Building from Source
-
Clone the repository:
git clone https://github.com/BiradarSiddhant02/Proxi.git cd Proxi
-
Set up a Python virtual environment (recommended):
python3 -m venv .venv source .venv/bin/activate
-
Install build dependencies:
pip install -r requirements.txt
(Ensure
requirements.txtincludesscikit-build-coreandpybind11). -
Build and install Proxi:
pip install .
For development, you can use an editable install:
pip install -e .
This command invokes
scikit-build-corewhich usesCMaketo compile the C++ core and create the Python extension.
Quick Start
Here's a simple example of how to use Proxi in Python with the ProxiFlat module:
from proxi import ProxiFlat
import numpy as np
# 1. Sample data
embeddings = np.array([
[0.0, 0.0],
[1.0, 1.0],
[2.0, 2.0],
[3.0, 3.0]
], dtype=np.float32)
doc_ids = ["doc_a", "doc_b", "doc_c", "doc_d"]
# 2. Initialize ProxiFlat
# Parameters: k (number of neighbors), num_threads, objective_function ("l2", "l1", or "cos")
px = ProxiFlat(k=2, num_threads=2, objective_function="l2")
# 3. Index your data
px.index_data(embeddings.tolist(), doc_ids) # ProxiFlat expects lists of lists for embeddings
# 4. Prepare a query vector
query_vector = np.array([1.5, 1.5], dtype=np.float32)
# 5. Find nearest neighbor indices
indices = px.find_indices(query_vector.tolist())
print(f"Indices of nearest neighbors: {indices}")
# Example output: Indices of nearest neighbors: [1, 2] (or [2, 1] depending on exact distances)
# 6. Find nearest neighbor documents
docs = px.find_docs(query_vector.tolist())
print(f"Nearest documents: {docs}")
# Example output: Nearest documents: ['doc_b', 'doc_c']
# 7. Batched queries (for multiple queries at once)
query_batch = np.array([
[0.5, 0.5],
[2.5, 2.5]
], dtype=np.float32)
batch_indices = px.find_indices_batched(query_batch.tolist()) # Pass as list of lists
print(f"Batch indices: {batch_indices}")
# Example output: Batch indices: [[0, 1], [2, 3]]
batch_docs = px.find_docs_batched(query_batch.tolist()) # Pass as list of lists
print(f"Batch documents: {batch_docs}")
# Example output: Batch documents: [['doc_a', 'doc_b'], ['doc_c', 'doc_d']]
Benchmarking
Proxi includes scripts to benchmark its performance and to generate sample data.
1. Generate Mock Data
Use the scripts/make_data.py script to create synthetic datasets for benchmarking:
python scripts/make_data.py --N 10000 --D 128 --X_path X_data.npy --docs_path docs_data.npy
This will generate X_data.npy (feature vectors) and docs_data.npy (document identifiers).
2. Run Proxi Benchmark
Use scripts/bench_proxi.py to test Proxi's performance:
python scripts/bench_proxi.py --X_path X_data.npy --docs_path docs_data.npy -k 5 --threads 4 --objective l2
Adjust -k (number of neighbors), --threads, and --objective as needed.
3. Run FAISS Benchmark (for comparison)
If you have FAISS installed (pip install faiss-cpu or faiss-gpu), you can run a comparative benchmark:
python scripts/bench_faiss.py --X_path X_data.npy --docs_path docs_data.npy -k 5 --threads 4 --objective l2
Interactive Inference Example
Proxi includes an interactive script examples/inference.py that allows you to perform similarity searches on your own data and compare results with FAISS.
1. Download Embeddings and Corresponding Text
To use the inference script, you first need a dataset of embeddings and the corresponding text/words they represent.
For a demonstration, you can download pre-computed embeddings and words:
- Embeddings and words: https://www.kaggle.com/datasets/siddhantbiradar/proxi-live-inference-dataset
Download the zip file and unzip it.
2. Run the Inference Script
Navigate to the Proxi directory and run the script from your terminal:
python examples/inference.py --embeddings /path/to/your/embeddings.npy --words /path/to/your/words.npy -k 5
Arguments:
--embeddings: Path to the.npyfile containing your numerical embeddings.--words: Path to the.npyfile containing the corresponding text entries.-k: The number of nearest neighbors to retrieve for each query.
3. Interactive Search
Once the script loads the data and builds the Proxi and FAISS indexes, it will prompt you to enter a word or phrase:
Loading data...
Loaded 384000 embeddings with dimension 384
Building Proxi index...
Building FAISS index...
Loading sentence transformer model...
==================================================
Enter a word or phrase (or 'quit' to exit): your search query
Type your query and press Enter. The script will then display a table comparing the top-k results from Proxi and FAISS.
Similarity search results for: 'your search query'
+--------+-----------------+-----------------+
| Rank | Proxi Results | FAISS Results |
+========+=================+=================+
| 1 | result_proxi_1 | result_faiss_1 |
| 2 | result_proxi_2 | result_faiss_2 |
| ... | ... | ... |
+--------+-----------------+-----------------+
Enter 'quit' to exit the script.
This example provides a hands-on way to see Proxi in action.
Building and Development
- The core indexing and search logic is implemented in C++ within the
ProxiFlatclass (src/proxi_flat.cc,include/proxi_flat.h). - Helper functions for distance calculations are in
include/distance.hpp. - Python bindings are defined in
bindings/proxi_binding.cc. CMakeLists.txtmanages the C++ build process.pyproject.tomlandscikit-build-corehandle the Python package build and C++ compilation.- Tests are located in
tests/test_proxi_flat.py. Run them usingunittest:python -m unittest tests/test_proxi_flat.py
License
Proxi is licensed under the Apache License, Version 2.0. See the LICENSE.txt file for details.
Contributing
Contributions are welcome! If you have suggestions, bug reports, or want to contribute code, please feel free to open an issue or submit a pull request.
Happy Searching with Proxi!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file proxiss-0.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 565.2 kB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6ad3abc81d1bbd02b11519bcdcea23d872dcd74824fb5f68265f680fba4a46e
|
|
| MD5 |
2494f882b5e7125d669d884c0cf09349
|
|
| BLAKE2b-256 |
457734bb88da9d166ae685365ee51652316b9f9f591f17c12a9f1deaeafe88ce
|
File details
Details for the file proxiss-0.2.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 562.7 kB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7050c7b1ff02d500ed0edb9422f06563556d15c6740d77bb3b0a76eca3083dba
|
|
| MD5 |
8aef2c1098f5bc3ff3f4b4b90c0e7f36
|
|
| BLAKE2b-256 |
943a6dbbfe0a5242c81bfa9ae5ab99bb01164eabd9cabb08c7e9ff794fed6a61
|
File details
Details for the file proxiss-0.2.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 562.5 kB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50fb2a3c52e7999e74f4cf157493f9ff4e0a67ae41489fc176f231364e07a723
|
|
| MD5 |
cd421e5c11ab17cea4855518272343f2
|
|
| BLAKE2b-256 |
b7f63b9c22dae04f4d60782939e29f14d9cc758bfc53411514c5092a715de125
|
File details
Details for the file proxiss-0.2.0-cp313-cp313-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-cp313-cp313-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.13, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d425b61bd6fb1a2d300c3987b83b3ab7cc79423239ffcd8aecedb01f48a6d9e6
|
|
| MD5 |
bbaa0d27bcbbb4ce133b55891c2dae5e
|
|
| BLAKE2b-256 |
d7777e25ca868c497df2357c1d4a7e572ba85b0ce0ea965ea1fe23a8ad6106b6
|
File details
Details for the file proxiss-0.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 559.4 kB
- Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e3d9385a3fb7f83cb40e6782111da61788c93a7a1ae30560b5eaf6211acee85
|
|
| MD5 |
0b774e7aff48424856c068c815834ce8
|
|
| BLAKE2b-256 |
13451348b6ccd34dbfd8cfe751a088dfad8a7bc5431b3f56b1049538d87e8262
|
File details
Details for the file proxiss-0.2.0-cp312-cp312-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-cp312-cp312-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.12, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67e00c1b10357290d37e2bbbc2724cc1623f3813a0c7db5f6219138daaa6cb5d
|
|
| MD5 |
9819cf4fe44e19b3b2fc354ec1928b38
|
|
| BLAKE2b-256 |
354404a41a89c5b78fa99f1232497c2f44b67f5b377b62d9947a92e14a931150
|
File details
Details for the file proxiss-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 559.5 kB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ac56a0880260ef31edea4d8fc35385e9ed78c4d80ce7db116523d8a1c9ea3be
|
|
| MD5 |
11a03c0bdc1a6900b8f359915efbb7de
|
|
| BLAKE2b-256 |
909a122e401d009e8c6aabfec4b822ed80a383bc62a303b343276ddb4f6f953c
|
File details
Details for the file proxiss-0.2.0-cp311-cp311-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-cp311-cp311-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.11, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ac38a0b24934c22dec48aeaa86517566e850e0146afe523fead06e3c0eb8bd8
|
|
| MD5 |
9014337b84a96f58798b64405bfebd9f
|
|
| BLAKE2b-256 |
475b847c091bd17d10b971a948f037b134fc8a9f52585559d4afaaf9df09d134
|
File details
Details for the file proxiss-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 566.0 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61900c35373b272a613daf397e4461d9b54ca080bd410b6b815687ec340c4e30
|
|
| MD5 |
21d92732f9dc6f711490f8cdeb0a2556
|
|
| BLAKE2b-256 |
53fe947db180e469cf137fc3f0f815f7c8c8ff1e96f2f0a2bfc1adfe3de2df6a
|
File details
Details for the file proxiss-0.2.0-cp310-cp310-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-cp310-cp310-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.10, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02e1d577606b2f41e006f3b8493c147f4e0bea470e01281bfabc159c464f5cf9
|
|
| MD5 |
b4ac28594b9f685659aa5bfee5df4a25
|
|
| BLAKE2b-256 |
3fa31870349925f794371a50b5ce2f6363b2b3d95994d021942fe7daf6502ba1
|
File details
Details for the file proxiss-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 562.5 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
423eb038e3a3a2389b21e94348b3c9905df21c2d1953ade53ca08d6864af4719
|
|
| MD5 |
a96389d9937f4ee3ff897531cc49f2c5
|
|
| BLAKE2b-256 |
0d2529052f63dff9a4d0a2e3893ba8698b0d489dc7cfabd5af82e026add4c190
|
File details
Details for the file proxiss-0.2.0-cp39-cp39-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-cp39-cp39-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.9, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
440900f4341731604156364885166d463aadb61ee36ecd41c9f15f404ca6641f
|
|
| MD5 |
aa8705172735b0d94e9882aa4c17345a
|
|
| BLAKE2b-256 |
14e800ed55ecc495cfff053a4eff25ab77b65fdd52b7799501b8767930b288f7
|
File details
Details for the file proxiss-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: proxiss-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 564.2 kB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
089c9102370f85103ac92f1d264cfc3e467bc42704d3b154b0848e6381c55804
|
|
| MD5 |
1836b2353f1b67c338f72b60618c652e
|
|
| BLAKE2b-256 |
48cc59bc7f540cab0caeb746bd8fc0ec13ceb0fc064cf6b83719703012344d1c
|