Skip to main content

An efficient implementation of Vantage Point Tree for Python 3

Project description

Introduction

Pyvptree is a python library, internally built in C++, for efficient KNN search using metric distance function such as L2 distance (see VPTreeL2Index) or Hamming distances (VPTreeBinaryIndex).

How VP-Trees work

VP-Trees are binary trees that successively divide spaces in order to peform different types of tasks, such as Nearest Neighbor Search. It differs from Kd-Trees in the sense that they always partition the whole space, instead of invidiual dimensional axes, using a specific metric function and a selected "Vantage Point" that will be used as reference to allow spliting the dataset. For more details on how it works please access the following references:

Theoretical advantage of Vantage Points Trees compared to Kd-Trees

  • Intrinsic Dimensionality: VP-trees perform well in data with high intrinsic dimensionality, where the effective dimensionality is high. In contrast, kd-trees are efficient in low-dimensional spaces but their performance degrades quickly as the number of dimensions increases. This is often referred to as the "curse of dimensionality".

  • No Assumption about Axis-Alignment: Unlike kd-trees, which make a specific assumption about axis-alignment, VP-trees do not make such assumptions. This makes VP-trees potentially more robust to different kinds of data, especially when there is no natural way to align the axes with respect to the data.

  • Metric Spaces: VP-trees can handle general metric spaces (any space where a distance function is defined that satisfies the triangle inequality), not just Euclidean spaces. This makes them more adaptable to different problem settings where the distance metric might not be the standard Euclidean distance.

  • Handling Categorical Data: VP-trees, due to their use of arbitrary distance functions, can handle categorical data more naturally than kd-trees. While there are workarounds to use kd-trees with categorical data, they often require substantial tweaking and may not perform optimally. For categorical data, another reference structure is the BK-Tree, which is very efficient when comes to low dimensional data.

  • Balanced Tree Structure: VP-trees inherently try to create a balanced tree structure, which is beneficial for efficient searching. kd-trees can become unbalanced in certain situations, particularly with non-uniform data, leading to inefficient search operations.

It's important to notice, however, that practical implementation approaches such as using SIMD instructions or using highly optimized distance functions can strongly affect final performance in high dimensions. For instance, Faiss Library performs very efficiently in very high dimensions, where searches become near linear and exaustive, due to highly optimized code.

Typically spatial search structures tend to perform worse with increasing number of dimensions of dataset.

Installation

pip install .

Performance can dramatically decrease if this library is compiled without support to Open MP and AVX. This library was not tested under windows.

Requirement

This library needs OpenMP support to be built and installed. The whole compilation procces occur automatically by performing the installation step above.

Pickle serialization

vptree indices is pickle serializable:

import numpy as np
import pynear

np.random.seed(seed=42)

num_points = 20000
dimension = 32
num_queries = 2
data = np.random.rand(num_points, dimension).astype(dtype=np.uint8)

queries = np.random.rand(num_queries, dimension).astype(dtype=np.uint8)

vptree = pynear.VPTreeBinaryIndex()
vptree.set(data)

data = pickle.dumps(vptree)
recovered = pickle.loads(data)

Benchmarks

Several datasets were built and used for time benchmarks using L2 distance functions. Although Pyvptree does support Binary Index (VPTreeBinaryIndex) using hamming distance functions, the benchmarks focus on L2 distances since Pyvptree's binary index still need more optimizations to be able to compete with faiss in any way.

All benchmarks were generated using 12th Gen Intel(R) Core(TM) i7-1270P with 16 cores.

For each benchmark we use Faiss Library and Scikit-Learn as baseline of comparison.

The below benchmarks are for different values of K (1, 2, 4, 8, 16), comparing Faiss, Sklearn and Pyvptree.

Benchmarks are split into dimensionality ranges for better analysis.

To customize or regenerate the benchmarks as well as to see benchmark results, see benchmarks session.

Development

Running Python Tests

make test

Debugging and Running C++ Code on Unix

For debugging and running C++ code independently from python module, CMake config files are provided in pynear/CMakeLists.txt. For building and running C++ tests run:

make cpp-test

Since tests are built in Debug mode (default CMakeLists build mode), one can debug tests with gdb using built test binary:

gdb ./build/tests/vptree-tests

Debugging and Running C++ Code on Windows

Install CMake (for example py -m pip install cmake) and pybind11 (py -m pip install pybind11).

mkdir build
cd build
cmake ..\pynear

You may have to specify some arguments like the correct generator -G "Visual Studio 15 2017 Win64" or paths for Python -DPYTHON_EXECUTABLE="C:\Program Files\Python38\python.exe" and pybind11 -Dpybind11_DIR="C:\Program Files\Python38\Lib\site-packages\pybind11\share\cmake\pybind11" for CMake to work correctly.

Build generated files using Visual Studio (or whichever generator you chose) and run vptree-tests.exe.

Formating code

make fmt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pynear-0.1.0-pp39-pypy39_pp73-win_amd64.whl (105.3 kB view hashes)

Uploaded PyPy Windows x86-64

pynear-0.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (225.1 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

pynear-0.1.0-pp39-pypy39_pp73-macosx_10_9_x86_64.whl (445.6 kB view hashes)

Uploaded PyPy macOS 10.9+ x86-64

pynear-0.1.0-cp311-cp311-win_amd64.whl (106.4 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

pynear-0.1.0-cp311-cp311-musllinux_1_1_x86_64.whl (765.7 kB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

pynear-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (243.7 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

pynear-0.1.0-cp311-cp311-macosx_10_9_x86_64.whl (445.4 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

pynear-0.1.0-cp310-cp310-win_amd64.whl (105.6 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

pynear-0.1.0-cp310-cp310-musllinux_1_1_x86_64.whl (765.1 kB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

pynear-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (242.3 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

pynear-0.1.0-cp310-cp310-macosx_10_9_x86_64.whl (444.1 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

pynear-0.1.0-cp39-cp39-win_amd64.whl (105.6 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

pynear-0.1.0-cp39-cp39-musllinux_1_1_x86_64.whl (764.9 kB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

pynear-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (242.7 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

pynear-0.1.0-cp39-cp39-macosx_10_9_x86_64.whl (444.1 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

pynear-0.1.0-cp38-cp38-win_amd64.whl (105.5 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

pynear-0.1.0-cp38-cp38-musllinux_1_1_x86_64.whl (764.6 kB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

pynear-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (242.2 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

pynear-0.1.0-cp38-cp38-macosx_10_9_x86_64.whl (443.9 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

pynear-0.1.0-cp37-cp37m-win_amd64.whl (106.0 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

pynear-0.1.0-cp37-cp37m-musllinux_1_1_x86_64.whl (770.4 kB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ x86-64

pynear-0.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (246.2 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

pynear-0.1.0-cp37-cp37m-macosx_10_9_x86_64.whl (443.3 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page