Blaze Fast and Light Approximate Nearest Neighbor Search Database

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

PQLite

PQLite is a blaze fast Approximate Nearest Neighbor Search (ANNS) library.

WARNING

PQLite is still in the very early stages of development. APIs can and will change (now is the time to make suggestions!). Important features are missing. Documentation is sparse.
PQLite contains code that must be compiled to be used. The build is prepared in setup.py, users only need to pip install . from the root directory.

About

Features: A quick overview of PQlite's features.
Roadmap: The PQLite team's development plan.
Introducing PQLite: A blog post covering some of PQLite's features

Quick Start

Setup

$ git clone https://github.com/jina-ai/pqlite.git \
  && cd pqlite \
  && pip install .

How to use?

Create a new pqlite

import random
import numpy as np
from pqlite import PQLite

N = 10000 # number of data points
Nt = 2000
Nq = 10
D = 128 # dimentionality / number of features

Xt = np.random.random((Nt, D)).astype(np.float32)  # 2,000 128-dim vectors for training

# the column schema: (name:str, dtype:type, create_index: bool)
pqlite = PQLite(d_vector=D, n_cells=64, n_subvectors=8, columns=[('x', float, True)])
pqlite.fit(Xt)

Add new data

X = np.random.random((N, D)).astype(np.float32)  # 10,000 128-dim vectors to be indexed

tags = [{'x': random.random()} for _ in range(N)]
pqlite.add(X, ids=list(range(len(X))), doc_tags=tags)

Search with Filtering

query = np.random.random((Nq, D)).astype(np.float32)  # a 128-dim query vector

# without filtering
dists, ids = pqlite.search(query, k=5)

print(f'the result without filtering:')
for i, (dist, idx) in enumerate(zip(dists, ids)):
    print(f'query [{i}]: {dist} {idx}')

# with filtering
# condition schema: (column_name: str, relation: str, value: any)
conditions = [('x', '<', 0.3)]
dists, ids = pqlite.search(query, conditions=conditions, k=5)

print(f'the result with filtering:')
for i, (dist, idx) in enumerate(zip(dists, ids)):
    print(f'query [{i}]: {dist} {idx}')

Update data

Xn = np.random.random((10, D)).astype(np.float32)  # 10,000 128-dim vectors to be indexed

tags = [{'x': random.random()} for _ in range(10)]
pqlite.update(Xn, ids=list(range(len(Xn))), doc_tags=tags)

Delete data

pqlite.delete(ids=['1', '2'])

Benchmark

All experiments were performed with a Intel(R) Xeon(R) CPU @ 2.00GHz and Nvidia Tesla T4 GPU.

Yandex Research Benchmarks for Billion-Scale Similarity Search

TODO

Scalene a high-performance, high-precision CPU, GPU, and memory profiler for Python
Bolt 10x faster matrix and vector operations.
MADDNESS Multiplying Matrices Without Multiplying code
embeddinghub A vector database for machine learning embeddings.
mobius Möbius Transformation for Fast Inner Product Search on Graph

References

hyperfine Good UX example
PGM-index State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
Xor Filters Faster and Smaller Than Bloom Filters
CVPR20 Tutorial Billion-scale Approximate Nearest Neighbor Search
XOR-Quantization Fast top-K Cosine Similarity Search through XOR-Friendly Binary Quantization on GPUs
NeurIPS21 Challenge Billion-Scale Approximate Nearest Neighbor Search Challenge NeurIPS'21 competition track

Research foundations of PQLite

PAMI 2011 Product quantization for nearest neighbor search
CVPR 2016 Efficient Indexing of Billion-Scale Datasets of Deep Descriptors
NIPs 2017 Multiscale Quantization for Fast Similarity Search
NIPs 2018 Non-metric Similarity Graphs for Maximum Inner Product Search
ACMMM 2018 Reconfigurable Inverted Index code
ECCV 2018 Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors
CVPR 2019 Unsupervised Neural Quantization for Compressed-Domain Similarity Search
ICML 2019 Learning to Route in Similarity Graphs
ICML 2020 Graph-based Nearest Neighbor Search: From Practice to Theory

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.2.5

Feb 22, 2022

0.2.4

Feb 18, 2022

0.2.3

Feb 9, 2022

0.2.2

Feb 8, 2022

0.2.1

Jan 28, 2022

0.2.0

Jan 20, 2022

0.1.8

Feb 7, 2022

0.1.7

Jan 13, 2022

0.1.4

Jan 10, 2022

0.1.3

Dec 20, 2021

0.1.2

Dec 16, 2021

0.1.1

Dec 15, 2021

0.1.0

Dec 10, 2021

0.0.8

Dec 9, 2021

0.0.7

Dec 6, 2021

0.0.6

Dec 2, 2021

0.0.5

Dec 2, 2021

0.0.4

Dec 1, 2021

0.0.3

Dec 1, 2021

This version

0.0.2

Nov 25, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pqlite-0.0.2-patch.tar.gz (166.7 kB view hashes)

Uploaded Nov 25, 2021 Source

Hashes for pqlite-0.0.2-patch.tar.gz

Hashes for pqlite-0.0.2-patch.tar.gz
Algorithm	Hash digest
SHA256	`25841968a236058b6a7432871c536ea6ae294a55c5bd545b6110c3be0d32e001`
MD5	`4f8bfb2ec323575bc4f534b690bfa48b`
BLAKE2b-256	`5a1fca17bd22495750c6017e97f5a967d859582d1a5b52c4f74d604915c2febb`