A lightweight library for nearest neighbor search in Rust.
Project description
Overview
Most vector databases consume a lot of memory, especially when handling metadata. NilVec is designed to be more memory-efficient by embedding metadata directly within the vectors themselves.
In a traditional vector database, metadata should not be included within vectors, as it can significantly reduce the accuracy of nearest neighbor searches by contributing to distance calculations. NilVec avoids this issue by indexing only the core embedding components, thereby excluding metadata from the calculations and ensuring that metadata does not affect search accuracy.
How It Works
To achieve this separation, NilVec maintains a global map of metadata indexes. This map identifies where metadata is stored within the vectors, allowing NilVec to mask metadata during indexing and searching.
Conceptually, a vector that contains metadata is represented as:
$$ \begin{pmatrix} .0 \ .1 \ \vdots \ .511 \ \text{meta}_a \ \text{meta}_b \ \text{meta}_c \ \end{pmatrix} \begin{pmatrix} 1 \ 1 \ \vdots \ 1 \ 0 \ 0 \ 0 \ \end{pmatrix} = \begin{pmatrix} .0 \ .1 \ \vdots \ .511 \ .0 \ .0 \ .0 \ \end{pmatrix} $$
Here, the second vector acts as a mask, zeroing out metadata components so that they are not considered in the distance calculations. As a result, NilVec ignores metadata components during search operations, focusing solely on the embedding values.
Indexing and Metadata Retrieval
Metadata is retrieved using a global map of indexes that indicates which components of the vector correspond to metadata. For example:
index.map = {
"embedding": 0,
"meta_a": 512,
"meta_b": 513,
"meta_c": 514,
}
i = index.map["meta_a"] # 512
meta_a = v[i]
Implementational Philosophy
Google's ScaNN is one of the fastest and most efficient libraries for approximate nearest neighbor search. Its rules of thumb are:
- For a small dataset (fewer than $20 \text{k}$ points), use brute force.
- For a dataset with fewer than $100 \text{k}$ points, score with AH, then rescore.
- For datasets larger than $100 \text{k}$ points, partition, score with AH, then rescore.
- When scoring with AH,
dimensions_per_blockshould be set to $2$. - When partitioning,
num_leavesshould be roughly the square root of the number of data points.
Pinecone has the industry's most user-friendly interface. It's as easy as:
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="YOUR_API_KEY")
# Create a serverless index
# "dimension" needs to match the dimensions of the vectors you upsert
pc.create_index(
name="products",
dimension=1536,
spec=ServerlessSpec(cloud='aws', region='us-east-1')
)
# Target the index
index = pc.Index("products")
# Mock vector and metadata objects (you would bring your own)
vector = [0.010, 2.34,...] # len(vector) = 1536
metadata = {"id": 3056, "description": "Networked neural adapter"}
# Upsert your vector(s)
index.upsert(
vectors=[
{"id": "some_id", "values": vector, "metadata": metadata}
]
)
Testing
To test NilVec, run:
zig build test
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nilvec-0.1.3.tar.gz.
File metadata
- Download URL: nilvec-0.1.3.tar.gz
- Upload date:
- Size: 493.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3af191dc9239a873a2bb847cf8d1c6ff5459f4f0af0cc1f73bc1cd366b482e44
|
|
| MD5 |
fc0b19404cd3731c27830fc9c125865c
|
|
| BLAKE2b-256 |
7bd74e8b14b533e0b5bcb11af1a4e66bd92bfb18f1a2ef22f6254b49df2e6fde
|
File details
Details for the file nilvec-0.1.3-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: nilvec-0.1.3-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 287.0 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7dd4c2b7337818d5313bb4eca0427a71d55c0ec300047ae5953f94fa64bcbb9
|
|
| MD5 |
1fbd979ba1d159f2a7305e70746a1714
|
|
| BLAKE2b-256 |
a64b8762bcebbfbf01d4081303e087256123999663f76342abd9136b3ab78e27
|