Skip to main content

C++ extension implementing nearest neighbour descent

Project description

Nearest Neighbor Descent (nndescent)

Nearest Neighbor Descent (nndescent) is a C++ implementation of the nearest neighbor descent algorithm, designed for efficient and accurate approximate nearest neighbor search. With seamless integration into Python, it offers a powerful solution for constructing k-nearest neighbor graphs. This algorithm is based on the pynndescent library.

Features

  • Seamless integration into Python and effortless installation using pip.
  • The handling of nndescent is very similar to that of pynndescent.
  • Pure C++11 implementation utilizing OpenMP for parallel computation. No other external libraries are needed.
  • Currently tested only on Linux.
  • Both dense and sparse matrices are supported.
  • Implementation of multiple distance functions, i.e.
    • Bray-Curtis
    • Canberra
    • Chebyshev
    • Circular Kantorovich (no sparse verion)
    • Correlation
    • Cosine
    • Dice
    • Dot
    • Euclidean
    • Hamming
    • Haversine
    • Hellinger
    • Hellinger
    • Jaccard
    • Jensen-Shannon
    • Kulsinski
    • Manhattan
    • Matching
    • Minkowski
    • Rogers-Tanimoto
    • Russell-Rao
    • Sokal-Michener
    • Sokal-Sneath
    • Spearman's Rank Correlation (no sparse version)
    • Symmetric KL Divergence
    • TSSS
    • True Angular
    • Wasserstein 1D (no sparse version)
    • Yule

Please note that not all distances have undergone thorough testing. Therefore, it is advised to use them with caution and at your own discretion.

Installation

From PyPI

You can install nndescent directly from PyPI using pip:

pip install nndescent

If you want to run the examples in tests, additional packages are needed. You can install them manually or install nndescent with the full option:

pip install nndescent[full]

From Source

  1. Clone the repository:
git clone https://github.com/brj0/nndescent.git
cd nndescent
  1. Build and install the package:
pip install .

If you want to run the examples in tests, additional packages are needed. You can install them manually or install nndescent with the full option:

pip install .[full]
  1. To run the examples in tests you must first download the datasets:
python tests/make_test_data.py

Usage

In Python you can utilize the nndescent library in the following way:

import numpy as np
import nndescent

# Data must be a 2D numpy array of dtype 'float32'.
data = np.random.randint(50, size=(20,3)).astype(np.float32)

# Run NND algorithm
nnd = nndescent.NNDescent(data, n_neighbors=4)

# Get result
nn_indices, nn_distances = nnd.neighbor_graph

# Query data must be a 2D numpy array of dtype 'float32'.
query_data = np.random.randint(50, size=(5,3)).astype(np.float32)

# Calculate nearest neighbors for each query point
nn_query_indices, nn_query_distances = nnd.query(query_data, k=6)

To compile and run the C++ examples use the following commands within the project folder:

mkdir build
cd build
cmake ..
make
./simple

For detailed usage in C++ and for further Python/C++ examples please refer to the examples provided in the tests directory of the repository and the code documentation.

Performance

On my computer, the training phase of nndescent is approximately 15% faster than pynndescent for dense matrices, and 75% faster for sparse matrices. Furthermore, the search query phase shows a significant improvement, with >70% faster execution time. Below is the output obtained from running tests/benchmark.py, an ad hoc benchmark test. In this test, both nndescent and pynndescent were executed with the same parameters using either 'euclidean' or 'dot' as metric:

Benchmark test pynndescent (py) vs nndescent (c)

Data set py train [ms] c train [ms] ratio py vs c match py test [ms] c test [ms] ratio py accuracy c accuracy
faces 149.8 145.9 0.974 1.000 1663.7 18.4 0.011 1.000 0.999
fmnist 11959.2 10768.7 0.900 0.997 5754.8 1334.1 0.232 0.978 0.978
glove25 149754.2 101864.0 0.680 0.964 98740.6 9907.4 0.100 0.796 0.808
glove50 192965.8 137171.8 0.711 0.885 99750.8 10647.7 0.107 0.705 0.743
glove100 218202.9 180088.4 0.825 0.815 98770.2 12080.4 0.122 0.651 0.731
glove200 287206.6 243466.6 0.848 0.772 101639.4 17615.6 0.173 0.622 0.773
mnist 11319.7 10188.1 0.900 0.997 5725.9 1273.8 0.222 0.969 0.968
nytimes 63323.8 55638.1 0.879 0.814 23632.1 7108.9 0.301 0.614 0.811
sift 131711.4 105826.0 0.803 0.974 82503.7 7957.9 0.096 0.838 0.839
20newsgroups 107339.0 28339.7 0.264 0.922 67518.6 22513.1 0.333 0.858 0.929

The compilation time and the lengthy numba loading time during runtime and import for 'pynndescent' are not considered in this ad hoc benchmark test. An Ann-Benchmarks wrapper is planned for the future.

Background

The theoretical background of NND is based on the following paper:

In addition, the algorithm utilizes random projection trees for initializing the nearest neighbor graph. The nndescent algorithm constructs a tree by randomly selecting two points and splitting the data along a hyperplane passing through their midpoint. For a more theoretical background, please refer to:

Contributing

Contributions are welcome! If you have any bug reports, feature requests, or suggestions, please open an issue or submit a pull request.

License

This project is licensed under the BSD-2-Clause license.

Acknowledgements

This implementation is based on the original pynndescent library by Leland McInnes. I would like to acknowledge and appreciate his work as a source of inspiration for this project.

For more information, visit the pynndescent GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nndescent-1.0.4.tar.gz (42.6 kB view details)

Uploaded Source

File details

Details for the file nndescent-1.0.4.tar.gz.

File metadata

  • Download URL: nndescent-1.0.4.tar.gz
  • Upload date:
  • Size: 42.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for nndescent-1.0.4.tar.gz
Algorithm Hash digest
SHA256 19b7ae61b8e8abce8cece691208e38abb44804a3c5d9fbf1dd541c1493c18bb0
MD5 f877287c1ee9bcf784bb76213401b200
BLAKE2b-256 2624cccc96ed7911fa438f9ad6682484f76231089707a01455cfd9516f6f8c3e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page