Skip to main content

A performant graph-based kNN search library with re-ordering.

Project description

FlatNav

FlatNav is a fast and header-only graph-based index for Approximate Nearest Neighbor Search (ANNS). FlatNav is inspired by the influential Hierarchical Navigable Small World (HNSW) index, but with the hierarchical component removed. As detailed in our research paper, we found that FlatNav achieved identical performance to HNSW on high-dimensional datasets (dimensionality > 32) with approximately 38% less peak memory consumption and a simplified implementation.

We hope to maintain this open source library as a resource for broader community. Please consider opening a Github Issue for bugs and feature requests, or get in touch with us directly for discussions.

Installation

FlatNav is implemented in C++ with a complete Python extension with cereal as the only external dependency. This is a header-only library, so there is nothing to build. You can just include the necessary headers in your existing code.

FlatNav is supported on x86-64 machines on linux and MacOS (we can extend this to windows if there is sufficient interest). To get the C++ library working and run examples under the tools directory, you will need

  • C++17 compiler with OpenMP support (version >= 2.0)
  • CMake (version >= 3.14)

We provide some helpful scripts for installing the above in the bin directory.

To generate the library with CMake and compile examples, run

$ git clone https://github.com/BlaiseMuhirwa/flatnav.git --recurse-submodules
$ cd flatnav
$ ./bin/build.sh -e

You can get all options available with the build.sh script by passing it the -h argument.

This will display all available build options:

Usage ./build.sh [OPTIONS]

Available Options:
  -t, --tests:                    Build tests
  -e, --examples:                 Build examples
  -v, --verbose:                  Make verbose
  -b, --benchmark:                Build benchmarks
  -bt, --build_type:              Build type (Debug, Release, RelWithDebInfo, MinSizeRel)
  -nmv, --no_simd_vectorization:Disable SIMD instructions
  -h, --help:                     Print this help message

Example Usage:
  ./build.sh -t -e -v

To build the Python bindings, follow instructions here. There are also examples for how to use the library to build an index and run queries on top of it here.

Support for SIMD Extensions

We currently support SIMD extensions for certain platforms as detailed below.

Operation x86_64 arm64v8 Apple silicon
FP32 Inner product SSE, AVX, AVX512 No SIMD support No SIMD support
FP32 L2 distance SSE, AVX, AVX512 No SIMD support No SIMD support
UINT8 L2 distance AVX512 No SIMD support No SIMD support
INT8 L2 distance SSE No SIMD support No SIMD support

Getting Started in Python

Currently, we support Python wheels for versions 3.8 through 3.12 on x86_64 architectures (Intel, AMD and MacOS). Support for ARM wheels is a future improvement.

The python library can be installed from PyPI by using

$ pip install flatnav

Similarly, flatnav can be installed from source via cibuildwheel, which builds cross-platform wheels. Follow the following steps

$ git clone https://github.com/BlaiseMuhirwa/flatnav.git --recurse-submodules
$ cd flatnav
$ make install-cibuildwheel

# This will build flatnav for the current version in your environment. If you want to build wheels 
# for all supported python versions (3.8 to 3.12), remove the --current-version flag.
$ ./cibuild.sh --current-version

$ pip install wheelhouse/flatnav*.whl --force-reinstall

Once you have the python library installed and you have a dataset you want to index as a numpy array, you can construct the index as shown below. This will allocate memory and create a directed graph with vectors as nodes.

import numpy as np
import flatnav
from flatnav.data_type import DataType 

# Get your numpy-formatted dataset.
dataset_size = 1_000_000
dataset_dimension = 128
dataset_to_index = np.random.randn(dataset_size, dataset_dimension)

# Define index construction parameters.
distance_type = "l2"
max_edges_per_node = 32
ef_construction = 100
num_build_threads = 16

# Create index configuration and pre-allocate memory
index = flatnav.index.create(
    distance_type=distance_type,
    index_data_type=DataType.float32,
    dim=dataset_dimension,
    dataset_size=dataset_size,
    max_edges_per_node=max_edges_per_node,
    verbose=True,
    collect_stats=True,
)
index.set_num_threads(num_build_threads)

# Now index the dataset 
index.add(data=dataset_to_index, ef_construction=ef_construction)

Note that we specified DataType.float32 to indicate that we want to build an index with vectors represented with float type. If you want to use a different precision, such as uint8_t or int8_t (which are the only other ones currently supported), you can use DataType.uint8 or DataType.int8. The distance type can either be l2 or angular. The collect_stats flag will record the number of distance evaluations.

To query the index we just created by generating IID vectors from the standard normal distribution, we do it as follows

# Set query-time parameters 
k = 100
ef_search = 100

# Run k-NN query with a single thread.
index.set_num_threads(1)

queries = np.random.randn(1000, dataset_to_index.shape[1])
for query in queries:
  distances, indices = index.search_single(
    query=query,
    ef_search=ef_search,
    K=k,
  )

You can parallelize the search by setting the number of threads to a desired number and using a different API that also returns the exact same results as search_single.

index.set_num_threads(16)
distances, indices = index.search(queries=queries, ef_search=ef_search, K=k)

Getting Started in C++

As mentioned earlier, there is nothing to build since this is header-only. We will translate the above Python code in C++ to illustrate how to use the C++ API.

#include <cstdint>
#include <flatnav/index/Index.h>
#include <flatnav/distances/SquaredL2Distance.h>
#include <flatnav/distances/DistanceInterface.h>

template <typename dist_t>
void run_knn_search(Index<dist_t, int>>* index, float *queries, int* gtruth, 
        int ef_search, int K, int num_queries, int num_gtruth, int dim) {

  float mean_recall = 0;
  for (int i = 0; i < num_queries; i++) {
    float *q = queries + dim * i;
    int *g = gtruth + num_gtruth * i;
    std::vector<std::pair<float, int>> result =
        index->search(q, K, ef_search);

    float recall = 0;
    for (int j = 0; j < K; j++) {
      for (int l = 0; l < K; l++) {
        if (result[j].second == g[l]) {
          recall = recall + 1;
        }
      }
    }
    recall = recall / K;
    mean_recall = mean_recall + recall;
  }
}


int main(int argc, char** argv) {
  uint32_t dataset_size = 1000000;
  uint32_t dataset_dimension = 128;

  // We skip the random data generation, but you can do that with std::mt19937, std::random_device 
  // and std::normal_distribution
  // std::vector<float> dataset_to_index; 

  uint32_t max_edges_per_node = 32;
  uint32_t ef_construction = 100;

  // Create an index with l2 distance 
  auto distance = SquaredL2Distance<>::create(dataset_dimension);
  auto* index = new Index<SquaredL2Distance<DataType::float32>>, int>(
      /* dist = */ std::move(distance), /* dataset_size = */ dataset_size,
      /* max_edges_per_node = */ max_edges_per_node);

  index->setNumThreads(build_num_threads);

  std::vector<int> labels(dataset_size);
  std::iota(labels.begin(), labels.end(), 0);
  index->template addBatch<float>(/* data = */ (void *)dataset_to_index,
                                  /* labels = */ labels,
                                  /* ef_construction */ ef_construction);

  // Now query the index and compute the recall 
  // We assume you have a ground truth (int*) array and a queries (float*) array
  uint32_t ef_search = 100;
  uint32_t k = 100;
  uint32_t num_queries = 1000;
  uint32_t num_gtruth = 1000;
  
  // Query the index and compute the recall. 
  run_knn_search(index, queries, gtruth, ef_search, k, num_queries, num_gtruth, dataset_dimension);
}

Datasets from ANN-Benchmarks

ANN-Benchmarks provide HDF5 files for a standard benchmark of near-neighbor datasets, queries and ground-truth results. To index any of these datasets you can use the construct_npy.cpp and query_npy.cpp files linked above.

To generate the ANNS benchmark datasets, run the following script

$ ./bin/download_anns_datasets.sh <dataset-name> [--normalize]

For datasets that use the angular/cosine similarity, you will need to use --normalize option so that the distances are computed correctly.

Available dataset names include:

_ mnist-784-euclidean
_ sift-128-euclidean
_ glove-25-angular
_ glove-50-angular
_ glove-100-angular
_ glove-200-angular
_ deep-image-96-angular
_ gist-960-euclidean
_ nytimes-256-angular

Experimental API and Future Extensions

You can find the current work under development under the development-features directory. While some of these features may be usable, they are not guarranteed to be stable. Stable features will be expected to be part of the PyPI releases. The most notable on-going extension that's under development is product quantization.

Citation

If you find this library useful, please consider citing our associated paper:

@article{munyampirwa2024down,
  title={Down with the Hierarchy: The'H'in HNSW Stands for" Hubs"},
  author={Munyampirwa, Blaise and Lakshman, Vihan and Coleman, Benjamin},
  journal={arXiv preprint arXiv:2412.01940},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flatnav-0.1.2.tar.gz (22.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

flatnav-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (455.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

flatnav-0.1.2-cp312-cp312-macosx_10_14_x86_64.whl (292.5 kB view details)

Uploaded CPython 3.12macOS 10.14+ x86-64

flatnav-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (459.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

flatnav-0.1.2-cp311-cp311-macosx_10_14_x86_64.whl (288.4 kB view details)

Uploaded CPython 3.11macOS 10.14+ x86-64

flatnav-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (459.9 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

flatnav-0.1.2-cp310-cp310-macosx_10_14_x86_64.whl (288.4 kB view details)

Uploaded CPython 3.10macOS 10.14+ x86-64

flatnav-0.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (458.4 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

flatnav-0.1.2-cp39-cp39-macosx_10_14_x86_64.whl (288.5 kB view details)

Uploaded CPython 3.9macOS 10.14+ x86-64

flatnav-0.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (459.3 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

flatnav-0.1.2-cp38-cp38-macosx_10_14_x86_64.whl (288.2 kB view details)

Uploaded CPython 3.8macOS 10.14+ x86-64

File details

Details for the file flatnav-0.1.2.tar.gz.

File metadata

  • Download URL: flatnav-0.1.2.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for flatnav-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3f306c881a56408425143a2bfbca68eb6e4a3092678a49a962aba9139d8611fa
MD5 468cb3e30e7234a186ed1354986d1cdf
BLAKE2b-256 ea1724f0a463c9c84aa491442fb295b5a28a81b81330975a3cff2d9e532175a9

See more details on using hashes here.

File details

Details for the file flatnav-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for flatnav-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 16b21323ce99b63da53328c0662f59cab62e2392f35f11ad2e6f9e3d39a76813
MD5 06f1a7e0564a484042e4e0e15f6549d8
BLAKE2b-256 ad1f31782e4d45fe48b97393ce24379e007cdd9a2d9ec5be35a46dca7232ceb2

See more details on using hashes here.

File details

Details for the file flatnav-0.1.2-cp312-cp312-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for flatnav-0.1.2-cp312-cp312-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 9aa5f0d1e3a6189a05ab95eff19e58ba891bf8a6c5516a35e19609e54b0b72d2
MD5 fc9f94785182f8f745346d9ec406712e
BLAKE2b-256 c7574152fe40d89da30479b895d368aa0342757bdeb108c7693e9397b20642fd

See more details on using hashes here.

File details

Details for the file flatnav-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for flatnav-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c5d7d5e971aa55fb73f995ad2855d9c6cd481bed3bb1e35db26228e4fa1a92e4
MD5 22464d6e724bc2d58d48cf546b4fa956
BLAKE2b-256 cd8f705941091e2f9d63d492ea16ff8298f904576c0ac2cf1ec406d04b0b25d0

See more details on using hashes here.

File details

Details for the file flatnav-0.1.2-cp311-cp311-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for flatnav-0.1.2-cp311-cp311-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 4a9d28e5fd6df624df15cc26f8024ee03052caf07e7f007db0f94f8d6aa280e5
MD5 c0eb13128427c471bd6020a2b0487c44
BLAKE2b-256 8565e93246f8491952f26912ca85bd49ff594ecc0f5de1f171683fce3f114c4e

See more details on using hashes here.

File details

Details for the file flatnav-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for flatnav-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 501aa33499eaecc23e9d48c5b8aef98baf206627ef80e9a64c06710404652f7d
MD5 0f23b1d44f6012493f6b6533c1a7a15c
BLAKE2b-256 2d4706802671bf6a86d50b8febf7a0f7b7d0e1f884dc613bd484eb4464ed6082

See more details on using hashes here.

File details

Details for the file flatnav-0.1.2-cp310-cp310-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for flatnav-0.1.2-cp310-cp310-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 80fae3692b61b82980511988b78b8ad1a7875fc16008f4e0a2895c3fc6973c39
MD5 71dcd5c57579c87ab5ec50440d59e5d4
BLAKE2b-256 7e50fd149362f3bbacaa93301fa98c89630b41ee04957fecb4161c97f99a764a

See more details on using hashes here.

File details

Details for the file flatnav-0.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for flatnav-0.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 38e89bfa0d9ee3240fddb16b949d47682b0dde7c44b23f8570b03e9c58d615e7
MD5 0c5e30125fb3e70400f37bb15aea8a08
BLAKE2b-256 d1fe7a60f17c4e2163db1f01c39d22308b10be62eb13618b4086daa9b13c331c

See more details on using hashes here.

File details

Details for the file flatnav-0.1.2-cp39-cp39-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for flatnav-0.1.2-cp39-cp39-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 ce3a19ec8ac56d87f7d22b5a4d4cdcc0dbbbcf7915a3c42623ee1166e497f3a0
MD5 ed76a1aee711565e3e8dbeb0739c51c0
BLAKE2b-256 0b5fe161be105f452a9752da10e7146d7204a0f6d9a7fd2cb8c0bfe9d6636aa1

See more details on using hashes here.

File details

Details for the file flatnav-0.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for flatnav-0.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9cf50abe449a906f67706406166f968459d6cfc45327d5182aca939deb2ad642
MD5 9bb681d78ec3fac42202c5d39674d91f
BLAKE2b-256 01aa84780ff6856d8db0ce5823af83bf6fa13fafa971a44a4ed2f1f3f7044031

See more details on using hashes here.

File details

Details for the file flatnav-0.1.2-cp38-cp38-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for flatnav-0.1.2-cp38-cp38-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 fcc028febf21191a63720249883a76d1e503bfc2217c24ee0902c3004afdd065
MD5 d231afd580dd9257fd8f17b099de7030
BLAKE2b-256 584ef724e1ac1b9bd4151d5a6941dad66376756384644fcd42512ea53cbfdbcf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page