An efficient implementation of Vantage Point Tree for Python 3
Project description
Introduction
Pyvptree is a python library, internally built in C++, for efficient KNN search using metric distance function such as L2 distance (see VPTreeL2Index) or Hamming distances (VPTreeBinaryIndex).
How VP-Trees work
VP-Trees are binary trees that successively divide spaces in order to peform different types of tasks, such as Nearest Neighbor Search. It differs from Kd-Trees in the sense that they always partition the whole space, instead of invidiual dimensional axes, using a specific metric function and a selected "Vantage Point" that will be used as reference to allow spliting the dataset. For more details on how it works please access the following references:
- https://en.wikipedia.org/wiki/Vantage-point_tree
- https://fribbels.github.io/vptree/writeup
- Probabilistic analysis of vantage point trees
Theoretical advantage of Vantage Points Trees compared to Kd-Trees
-
Intrinsic Dimensionality: VP-trees perform well in data with high intrinsic dimensionality, where the effective dimensionality is high. In contrast, kd-trees are efficient in low-dimensional spaces but their performance degrades quickly as the number of dimensions increases. This is often referred to as the "curse of dimensionality".
-
No Assumption about Axis-Alignment: Unlike kd-trees, which make a specific assumption about axis-alignment, VP-trees do not make such assumptions. This makes VP-trees potentially more robust to different kinds of data, especially when there is no natural way to align the axes with respect to the data.
-
Metric Spaces: VP-trees can handle general metric spaces (any space where a distance function is defined that satisfies the triangle inequality), not just Euclidean spaces. This makes them more adaptable to different problem settings where the distance metric might not be the standard Euclidean distance.
-
Handling Categorical Data: VP-trees, due to their use of arbitrary distance functions, can handle categorical data more naturally than kd-trees. While there are workarounds to use kd-trees with categorical data, they often require substantial tweaking and may not perform optimally. For categorical data, another reference structure is the BK-Tree, which is very efficient when comes to low dimensional data.
-
Balanced Tree Structure: VP-trees inherently try to create a balanced tree structure, which is beneficial for efficient searching. kd-trees can become unbalanced in certain situations, particularly with non-uniform data, leading to inefficient search operations.
It's important to notice, however, that practical implementation approaches such as using SIMD instructions or using highly optimized distance functions can strongly affect final performance in high dimensions. For instance, Faiss Library performs very efficiently in very high dimensions, where searches become near linear and exaustive, due to highly optimized code.
Typically spatial search structures tend to perform worse with increasing number of dimensions of dataset.
Installation
pip install .
Performance can dramatically decrease if this library is compiled without support to Open MP and AVX. This library was not tested under windows.
Requirement
This library needs OpenMP support to be built and installed. The whole compilation procces occur automatically by performing the installation step above.
Pickle serialization
vptree indices is pickle serializable:
import numpy as np
import pynear
np.random.seed(seed=42)
num_points = 20000
dimension = 32
num_queries = 2
data = np.random.rand(num_points, dimension).astype(dtype=np.uint8)
queries = np.random.rand(num_queries, dimension).astype(dtype=np.uint8)
vptree = pynear.VPTreeBinaryIndex()
vptree.set(data)
data = pickle.dumps(vptree)
recovered = pickle.loads(data)
Benchmarks
Several datasets were built and used for time benchmarks using L2 distance functions. Although Pyvptree does support Binary Index (VPTreeBinaryIndex) using hamming distance functions, the benchmarks focus on L2 distances since Pyvptree's binary index still need more optimizations to be able to compete with faiss in any way.
All benchmarks were generated using 12th Gen Intel(R) Core(TM) i7-1270P with 16 cores.
For each benchmark we use Faiss Library and Scikit-Learn as baseline of comparison.
The below benchmarks are for different values of K (1, 2, 4, 8, 16), comparing Faiss, Sklearn and Pyvptree.
Benchmarks are split into dimensionality ranges for better analysis.
To customize or regenerate the benchmarks as well as to see benchmark results, see benchmarks session.
Development
Running Python Tests
make test
Debugging and Running C++ Code on Unix
For debugging and running C++ code independently from python module, CMake config files are provided in pynear/CMakeLists.txt. For building and running C++ tests run:
make cpp-test
Since tests are built in Debug mode (default CMakeLists build mode), one can debug tests with gdb using built test binary:
gdb ./build/tests/vptree-tests
Debugging and Running C++ Code on Windows
Install CMake (for example py -m pip install cmake
) and pybind11 (py -m pip install pybind11
).
mkdir build
cd build
cmake ..\pynear
You may have to specify some arguments like the correct generator -G "Visual Studio 15 2017 Win64"
or paths for Python -DPYTHON_EXECUTABLE="C:\Program Files\Python38\python.exe"
and pybind11 -Dpybind11_DIR="C:\Program Files\Python38\Lib\site-packages\pybind11\share\cmake\pybind11"
for CMake to work correctly.
Build generated files using Visual Studio (or whichever generator you chose) and run vptree-tests.exe
.
Formating code
make fmt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for pynear-0.1.0-pp39-pypy39_pp73-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | be4de7a0d9e6fad5c6d9f269665df64b8c6f2d1dd8d25ea7e6521d95eccfa3e5 |
|
MD5 | ec51a3ceb1013aff46dcf5487c12e291 |
|
BLAKE2b-256 | ac8bbb94afd395a3ba95265279489bde7d94821a6273b01d5685e5fd9901bd33 |
Hashes for pynear-0.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6cd16299ddd140fbd60f0bed829e410099d6010b2c300170761a2635df93433 |
|
MD5 | 9771f43056fed2ca6d28082b878eeb0b |
|
BLAKE2b-256 | 613f0cc9646e4fee2e19ced0e1ca0668f9d7a269fe5d0b8f43717aa94b94faa8 |
Hashes for pynear-0.1.0-pp39-pypy39_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 176a05dfef7730fde3a6f5a868d9d1fc4095a560cd6532c6a0d2e1ff2299672a |
|
MD5 | cb29c5ee8d9aadf4545cdb666b2bd9ef |
|
BLAKE2b-256 | 109e7e6665127fe5248ab25be37dfed61d906d332c4ea3a044f5f0fb63b18bb6 |
Hashes for pynear-0.1.0-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ae74206f8dc33d3bf17d44cf0cf6efd70464bdc726f3de4571914195fe38962e |
|
MD5 | 97c3120bd16e02fa90d86429797951d8 |
|
BLAKE2b-256 | 8fd20aa1324f4bd941cb28203aea8016acb80a060b1455e79a8ecfe79b503580 |
Hashes for pynear-0.1.0-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e0b71dd9a75de087b69ed7a35d7052d435c902361f8463b0568a948111889965 |
|
MD5 | 90e4260ce21d834638d234daa2c38a44 |
|
BLAKE2b-256 | 9302487da1d158c8b527a201de50f167d1d8f63fc47fcf33d70aae9671c4d560 |
Hashes for pynear-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8295cf84fa170dda90677ca9838e3bb1deb016f9b6f4bae7a60d1dc56048af6e |
|
MD5 | 0e03fdce0dd58fcec754a8d1c1ba79f4 |
|
BLAKE2b-256 | b711b1073df4a041ef6eb79ffc45bff4a40cf861c19d0ec2e513e2d5de63b54e |
Hashes for pynear-0.1.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a8178d22fbad37b07bebdbe2c4116f3b5728e4a25ce4a4cd21094883410af146 |
|
MD5 | 47bf20851d17ce901f1f77b955b3d0a8 |
|
BLAKE2b-256 | feb6742e05cec47b7253e0c7780ab877dcfdb9b26ca0d8c4c4ae82386531c336 |
Hashes for pynear-0.1.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 206643c5ca218d9f7823f5f3096d6488bb0c007bacbec19bca8473c579ec7f49 |
|
MD5 | 5f47722f325289e96e2ab52eff71066c |
|
BLAKE2b-256 | 177d0b84362ab838d4f1b6034b82127739ddcc3570854b71a5efd1e0213c2442 |
Hashes for pynear-0.1.0-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99398048fb50cba2f8fa9a29350506e026ef1bbe2b9a1912de8e3845da928185 |
|
MD5 | 9f68126d4fe04c1c2f2c6bb6fcb3740f |
|
BLAKE2b-256 | d9fc434dd5bb9889baea921bdcf882e959d0ceb0c848ac40c571f9cc9cbd43a5 |
Hashes for pynear-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 373fede4e2e13fed4365e9e911fbe213061b191681c2e1d7a4e4f6c1b80e9fc8 |
|
MD5 | 719857944d1ec89fafea1d70ba4e3dbe |
|
BLAKE2b-256 | b0e0887d0f982ec792318dbc8ac215482bdae660ebd110f5fd2d3dbb272c55bb |
Hashes for pynear-0.1.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2fff8b699fcc2d6703602e4478d596d592cb0d67358404b62a246031532a523f |
|
MD5 | 3da8ca559b55ad49fbb5ede4b0175e8a |
|
BLAKE2b-256 | ec0f31590209b162eca5ddec3e8b5e3e7172e9365e78a9c51737920634947645 |
Hashes for pynear-0.1.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d5da5660ad58eb87d4e273686786e4e5283c9946b25a986c21bd136031e7b800 |
|
MD5 | 2ba4270d7e54c60b30ea8fe19c9d1244 |
|
BLAKE2b-256 | 126ff03b20cf4a327ef20f07100b1062f6f7969481c5fdb4b8df5be19e786c2a |
Hashes for pynear-0.1.0-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6caa6489c3464c8d92d422fb489e550db5bc31c9234b2dddcec61a41ec96ae0a |
|
MD5 | 06cbdf80ca02252f07515c9d47d6f424 |
|
BLAKE2b-256 | fce80bc483b036ec77e5a56044e18eb728df767a24d077c748b69a353d6a0f20 |
Hashes for pynear-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28b63d05fe6efc732357bcf14738d306bcb7f8954bc7085f12fe348ed7c8900a |
|
MD5 | ddc94f73dc5376416d47d95162d9efb8 |
|
BLAKE2b-256 | b3cb6cd36e01af09f4d62e3a2c04133548d4a2692affcae7e439e6e0b9c78567 |
Hashes for pynear-0.1.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d9fe9b47b16f8fd50758924549fc3c55d086fbd94f0247f21b0c9dcb7652e8d |
|
MD5 | df47cce3ce807495a874ab6e43de83c8 |
|
BLAKE2b-256 | fa34bc31f3b5772eace6e5e3cd68c67a2dcc040eebdf3418c80e886ce7e51648 |
Hashes for pynear-0.1.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c9b87d5aaa64e9203cf763606ac9f2cd012b5cc24b636add308c9e031df425e |
|
MD5 | 65abf8b9b264b6efdf03c50d66e406bf |
|
BLAKE2b-256 | af243c9968ccbd69d22bb00eaeb0b0de1e6893f2a58b628be1ae1c80375f037c |
Hashes for pynear-0.1.0-cp38-cp38-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c37d5fb4f2ba9d7ed0d816b527c5051ca06721cc45f81b0d075712b94d22a75 |
|
MD5 | 058fc8859ae7202eb013c252eae7f5bc |
|
BLAKE2b-256 | 26a7a65652df797eb485267391d9e62d0664ad165f18a89c109761b160eec64d |
Hashes for pynear-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc29e3b857fb8173d4ced47d249dfee253dc3b69b7a4dabf63ac2036307010f3 |
|
MD5 | 9e5e1c1509f41520731790fb0819dadb |
|
BLAKE2b-256 | d522ea268ee688738b00149aada7f63cbd833c8092994920d7cee6dceb66690c |
Hashes for pynear-0.1.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13c24e8ae7317b5d568fef00460e8ae8aca50057c0d21bfd306d3a8aa84ec684 |
|
MD5 | 38dd1d9b27888dea004d2dc03be6b03e |
|
BLAKE2b-256 | f8a385fc784e0e8b3ae8af06103a0708d938c9cd22522dfdde8eccdb69f0d4a8 |
Hashes for pynear-0.1.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd5ea6cb0a5c684f596061b245fe35a5780032f208cb792a275340bb7b4fa920 |
|
MD5 | 91dc5d04960195e092cb7164f0e2bc15 |
|
BLAKE2b-256 | 68e3f3b1d7783a4ca7f5c40900ec5e365c7eb0798ae0843ab10906c4652b4fde |
Hashes for pynear-0.1.0-cp37-cp37m-musllinux_1_1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6c09f1645a6a8ace4f03073ee6061954cede967eefff604b42d01bb573ad922 |
|
MD5 | 1596c4e2cc371799e30645d93ccf16f8 |
|
BLAKE2b-256 | 6503b5f2a50b49613fb59e8a78382cd47ff233b9ff46ed5c78554125af7eeb16 |
Hashes for pynear-0.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1aaff07f2f3d9ec07a857353cc6ae502ed7ee8b179ff5948d53f6f058032fc52 |
|
MD5 | 453856afff5099d5b900201ddb09cbe6 |
|
BLAKE2b-256 | e090adbf794d5fe3f16f3a955cce725c089e43e4ab221d575591583811e1685f |
Hashes for pynear-0.1.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 992686efed09973bfb7e71d3adad86779f7b8611e84c4a05673ff324e73bae63 |
|
MD5 | 2b0c764b7f81e570795def390e7ab9f7 |
|
BLAKE2b-256 | 697c9e3f9f1114b5d870f0a99bcfac8b7d8f2af9c7cb7f7cf102721c40c0995d |