Fast and Light Approximate Nearest Neighbor Search Database integrated with the Jina Ecosystem
Project description
PQLite
PQLite
is an Approximate Nearest Neighbor Search (ANNS) library integrated with the Jina ecosystem.
This indexer is recommended to be used when an application requires search with filters applied on Document
tags.
The filtering query language
is based on MongoDB's query and projection operators. We currently support a subset of those selectors.
The tags filters can be combined with $and
and $or
:
$eq
- Equal to (number, string)$ne
- Not equal to (number, string)$gt
- Greater than (number)$gte
- Greater than or equal to (number)$lt
- Less than (number)$lte
- Less than or equal to (number)$in
- Included in an array$nin
- Not included in an array
For example, we want to search for a product with a price no more than 50$
.
index.search(query, filter={"price": {"$lte": 50}})
More example filter expresses
- A Nike shoes with white color
{
"brand": {"$eq": "Nike"},
"category": {"$eq": "Shoes"},
"color": {"$eq": "White"}
}
Or
{
"$and":
{
"brand": {"$eq": "Nike"},
"category": {"$eq": "Shoes"},
"color": {"$eq": "White"}
}
}
- A Nike shoes or price less than
100$
{
"$or":
{
"brand": {"$eq": "Nike"},
"price": {"$lt": 100}
}
}
Installation
To install PQLite you can simply run:
pip install https://github.com/jina-ai/pqlite/archive/refs/heads/main.zip
Getting Started
For an in-depth overview of the features of PQLite you can follow along with one of the examples below:
Name | Link |
---|---|
E-commerce product image search with PQLite |
Quick Start
- Create a new
pqlite
import random
import numpy as np
from jina import Document, DocumentArray
from pqlite import PQLite
N = 10000 # number of data points
Nq = 10 # number of query data
D = 128 # dimentionality / number of features
# the column schema: (name:str, dtype:type, create_index: bool)
pqlite = PQLite(dim=D, columns=[('price', float)], data_path='./workspace_data')
Note that this will create a folder ./workspace_data
where indexed data will be stored.
If there is already a folder with this name and the code presented here is not working remove that folder.
- Add new data
X = np.random.random((N, D)).astype(np.float32) # 10,000 128-dim vectors to be indexed
docs = DocumentArray(
[
Document(id=f'{i}', embedding=X[i], tags={'price': random.random()})
for i in range(N)
]
)
pqlite.index(docs)
- Search with filtering
Xq = np.random.random((Nq, D)).astype(np.float32) # a 128-dim query vector
query = DocumentArray([Document(embedding=Xq[i]) for i in range(Nq)])
# without filtering
pqlite.search(query, limit=10)
print(f'the result without filtering:')
for i, q in enumerate(query):
print(f'query [{i}]:')
for m in q.matches:
print(f'\t{m.id} ({m.scores["euclidean"].value})')
# with filtering
pqlite.search(query, filter={"price": {"$lte": 50}}, limit=10)
print(f'the result with filtering:')
for i, q in enumerate(query):
print(f'query [{i}]:')
for m in q.matches:
print(f'\t{m.id} {m.scores["euclidean"].value} (price={m.tags["x"]})')
- Update data
Xn = np.random.random((10, D)).astype(np.float32) # 10,000 128-dim vectors to be indexed
docs = DocumentArray(
[
Document(id=f'{i}', embedding=Xn[i], tags={'price': random.random()})
for i in range(10)
]
)
pqlite.update(docs)
- Delete data
pqlite.delete(['1', '2'])
Benchmark
One can run executor/benchmark.py
to get a quick performance overview.
Stored data | Indexing time | Query size=1 | Query size=8 | Query size=64 |
---|---|---|---|---|
10000 | 2.970 | 0.002 | 0.013 | 0.100 |
100000 | 76.474 | 0.011 | 0.078 | 0.649 |
500000 | 467.936 | 0.046 | 0.356 | 2.823 |
1000000 | 1025.506 | 0.091 | 0.695 | 5.778 |
Results with filtering can be generated from examples/benchmark_with_filtering.py
. This script should produce a table similar to:
Stored data | % same filter | Indexing time | Query size=1 | Query size=8 | Query size=64 |
---|---|---|---|---|---|
10000 | 5 | 2.869 | 0.004 | 0.030 | 0.270 |
10000 | 15 | 2.869 | 0.004 | 0.035 | 0.294 |
10000 | 20 | 3.506 | 0.005 | 0.038 | 0.287 |
10000 | 30 | 3.506 | 0.005 | 0.044 | 0.356 |
10000 | 50 | 3.506 | 0.008 | 0.064 | 0.484 |
10000 | 80 | 2.869 | 0.013 | 0.098 | 0.910 |
100000 | 5 | 75.960 | 0.018 | 0.134 | 1.092 |
100000 | 15 | 75.960 | 0.026 | 0.211 | 1.736 |
100000 | 20 | 78.475 | 0.034 | 0.265 | 2.097 |
100000 | 30 | 78.475 | 0.044 | 0.357 | 2.887 |
100000 | 50 | 78.475 | 0.068 | 0.565 | 4.383 |
100000 | 80 | 75.960 | 0.111 | 0.878 | 6.815 |
500000 | 5 | 497.744 | 0.069 | 0.561 | 4.439 |
500000 | 15 | 497.744 | 0.134 | 1.064 | 8.469 |
500000 | 20 | 440.108 | 0.152 | 1.199 | 9.472 |
500000 | 30 | 440.108 | 0.212 | 1.650 | 13.267 |
500000 | 50 | 440.108 | 0.328 | 2.637 | 21.961 |
500000 | 80 | 497.744 | 0.580 | 4.602 | 36.986 |
1000000 | 5 | 1052.388 | 0.131 | 1.031 | 8.212 |
1000000 | 15 | 1052.388 | 0.263 | 2.191 | 16.643 |
1000000 | 20 | 980.598 | 0.351 | 2.659 | 21.193 |
1000000 | 30 | 980.598 | 0.461 | 3.713 | 29.794 |
1000000 | 50 | 980.598 | 0.732 | 5.975 | 47.356 |
1000000 | 80 | 1052.388 | 1.151 | 9.255 | 73.552 |
Note that:
- query times presented are represented in seconds.
% same filter
indicates the amount of data that verifies a filter in the database.- For example, if
% same filter = 10
andStored data = 1_000_000
then it means100_000
example verify the filter.
- For example, if
Implemented Algorithms
Currently PQLite supports:
- HNSW Algorithm (default choice)
- PQ-linear-scan (requires training)
Research foundations of PQLite
- Xor Filters Faster and Smaller Than Bloom Filters
- CVPR20 Tutorial Billion-scale Approximate Nearest Neighbor Search
- XOR-Quantization Fast top-K Cosine Similarity Search through XOR-Friendly Binary Quantization on GPUs
- NeurIPS21 Challenge Billion-Scale Approximate Nearest Neighbor Search Challenge NeurIPS'21 competition track
- PAMI 2011 Product quantization for nearest neighbor search
- CVPR 2016 Efficient Indexing of Billion-Scale Datasets of Deep Descriptors
- NIPs 2017 Multiscale Quantization for Fast Similarity Search
- NIPs 2018 Non-metric Similarity Graphs for Maximum Inner Product Search
- ACMMM 2018 Reconfigurable Inverted Index code
- ECCV 2018 Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors
- CVPR 2019 Unsupervised Neural Quantization for Compressed-Domain Similarity Search
- ICML 2019 Learning to Route in Similarity Graphs
- ICML 2020 Graph-based Nearest Neighbor Search: From Practice to Theory
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pqlite-0.2.5.tar.gz
.
File metadata
- Download URL: pqlite-0.2.5.tar.gz
- Upload date:
- Size: 182.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 960d3d866d750909154680d1e4bad49dcef4b5cea3ca17a37bcafd661af830e8 |
|
MD5 | 2b00c1134169e1f31aa292f6a6ad2d8e |
|
BLAKE2b-256 | 2ab095379922568644e36f67860ef570e0762d6036ddf330d15d5b71626c232e |