Skip to main content

Blaze Fast and Light Approximate Nearest Neighbor Search Database

Project description

PQLite

PQLite is a blaze fast Approximate Nearest Neighbor Search (ANNS) library.

WARNING

  • PQLite is still in the very early stages of development. APIs can and will change (now is the time to make suggestions!). Important features are missing. Documentation is sparse.
  • PQLite contains code that must be compiled to be used. The build is prepared in setup.py, users only need to pip install . from the root directory.

About

  • Features: A quick overview of PQlite's features.
  • Roadmap: The PQLite team's development plan.
  • Introducing PQLite: A blog post covering some of PQLite's features

Quick Start

Setup

$ git clone https://github.com/jina-ai/pqlite.git \
  && cd pqlite \
  && pip install .

How to use?

  1. Create a new pqlite
import random
import numpy as np
from jina import Document, DocumentArray
from pqlite import PQLite

N = 10000 # number of data points
Nq = 10 # number of query data
D = 128 # dimentionality / number of features

# the column schema: (name:str, dtype:type, create_index: bool)
pqlite = PQLite(dim=D, columns=[('x', float, True)], data_path='./data')
  1. Add new data
X = np.random.random((N, D)).astype(np.float32)  # 10,000 128-dim vectors to be indexed
docs = DocumentArray(
    [
        Document(id=f'{i}', embedding=X[i], tags={'x': random.random()})
        for i in range(N)
    ]
)
pqlite.index(docs)
  1. Search with Filtering
Xq = np.random.random((Nq, D)).astype(np.float32)  # a 128-dim query vector
query = DocumentArray([Document(embedding=Xq[i]) for i in range(Nq)])

# without filtering
pqlite.search(query, limit=10)

print(f'the result without filtering:')
for i, q in enumerate(query):
    print(f'query [{i}]:')
    for m in q.matches:
        print(f'\t{m.id} ({m.scores["euclidean"].value})')

# with filtering
# condition schema: (column_name: str, relation: str, value: any)
conditions = [('x', '<', 0.3)]
pqlite.search(query, conditions=conditions, limit=10)
print(f'the result with filtering:')
for i, q in enumerate(query):
    print(f'query [{i}]:')
    for m in q.matches:
        print(f'\t{m.id} {m.scores["euclidean"].value} (x={m.tags["x"]})')
  1. Update data
Xn = np.random.random((10, D)).astype(np.float32)  # 10,000 128-dim vectors to be indexed
docs = DocumentArray(
    [
        Document(id=f'{i}', embedding=Xn[i], tags={'x': random.random()})
        for i in range(10)
    ]
)
pqlite.update(docs)
  1. Delete data
pqlite.delete(['1', '2'])

Benchmark

All experiments were performed with a Intel(R) Xeon(R) CPU @ 2.00GHz and Nvidia Tesla T4 GPU.

TODO

  • Scalene a high-performance, high-precision CPU, GPU, and memory profiler for Python
  • Bolt 10x faster matrix and vector operations.
  • MADDNESS Multiplying Matrices Without Multiplying code
  • embeddinghub A vector database for machine learning embeddings.
  • mobius Möbius Transformation for Fast Inner Product Search on Graph

References

  • hyperfine Good UX example
  • PGM-index State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
  • Xor Filters Faster and Smaller Than Bloom Filters
  • CVPR20 Tutorial Billion-scale Approximate Nearest Neighbor Search
  • XOR-Quantization Fast top-K Cosine Similarity Search through XOR-Friendly Binary Quantization on GPUs
  • NeurIPS21 Challenge Billion-Scale Approximate Nearest Neighbor Search Challenge NeurIPS'21 competition track

Research foundations of PQLite

  • PAMI 2011 Product quantization for nearest neighbor search
  • CVPR 2016 Efficient Indexing of Billion-Scale Datasets of Deep Descriptors
  • NIPs 2017 Multiscale Quantization for Fast Similarity Search
  • NIPs 2018 Non-metric Similarity Graphs for Maximum Inner Product Search
  • ACMMM 2018 Reconfigurable Inverted Index code
  • ECCV 2018 Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors
  • CVPR 2019 Unsupervised Neural Quantization for Compressed-Domain Similarity Search
  • ICML 2019 Learning to Route in Similarity Graphs
  • ICML 2020 Graph-based Nearest Neighbor Search: From Practice to Theory

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pqlite-0.0.3.tar.gz (173.0 kB view details)

Uploaded Source

File details

Details for the file pqlite-0.0.3.tar.gz.

File metadata

  • Download URL: pqlite-0.0.3.tar.gz
  • Upload date:
  • Size: 173.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.3

File hashes

Hashes for pqlite-0.0.3.tar.gz
Algorithm Hash digest
SHA256 53a46d18a3d02318bfd86a0b952c405ca26dc50811f4efe2385c909257f53081
MD5 987e0b902f59fa459db4f4cc1bd295ab
BLAKE2b-256 6945f0661939b3032cc01187dc3f454cc4d6b0e48dbb8dd0be7495ef7f509f37

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page