Skip to main content

A library for approximate nearest neighbor search in Rust.

Project description

NilVec Logo

Overview

Most vector databases consume a lot of memory, especially when handling metadata. NilVec is designed to be more memory-efficient by embedding metadata directly within the vectors themselves.

In a traditional vector database, metadata should not be included within vectors, as it can significantly reduce the accuracy of nearest neighbor searches by contributing to distance calculations. NilVec avoids this issue by indexing only the core embedding components, thereby excluding metadata from the calculations and ensuring that metadata does not affect search accuracy.

How It Works

To achieve this separation, NilVec maintains a global map of metadata indexes. This map identifies where metadata is stored within the vectors, allowing NilVec to mask metadata during indexing and searching.

Conceptually, a vector that contains metadata is represented as:

$$ \begin{pmatrix} .0 \ .1 \ \vdots \ .511 \ \text{meta}_a \ \text{meta}_b \ \text{meta}_c \ \end{pmatrix} \begin{pmatrix} 1 \ 1 \ \vdots \ 1 \ 0 \ 0 \ 0 \ \end{pmatrix} = \begin{pmatrix} .0 \ .1 \ \vdots \ .511 \ .0 \ .0 \ .0 \ \end{pmatrix} $$

Here, the second vector acts as a mask, zeroing out metadata components so that they are not considered in the distance calculations. As a result, NilVec ignores metadata components during search operations, focusing solely on the embedding values.

Indexing and Metadata Retrieval

Metadata is retrieved using a global map of indexes that indicates which components of the vector correspond to metadata. For example:

index.map = {
    "embedding": 0,
    "meta_a": 512,
    "meta_b": 513,
    "meta_c": 514,
}

i = index.map["meta_a"]  # 512
meta_a = v[i]

Implementational Philosophy

Google's ScaNN is one of the fastest and most efficient libraries for approximate nearest neighbor search. Its rules of thumb are:

  • For a small dataset (fewer than $20 \text{k}$ points), use brute force.
  • For a dataset with fewer than $100 \text{k}$ points, score with AH, then rescore.
  • For datasets larger than $100 \text{k}$ points, partition, score with AH, then rescore.
  • When scoring with AH, dimensions_per_block should be set to $2$.
  • When partitioning, num_leaves should be roughly the square root of the number of data points.

Pinecone has the industry's most user-friendly interface. It's as easy as:

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="YOUR_API_KEY")

# Create a serverless index
# "dimension" needs to match the dimensions of the vectors you upsert
pc.create_index(
    name="products",
    dimension=1536,
    spec=ServerlessSpec(cloud='aws', region='us-east-1')
)

# Target the index
index = pc.Index("products")

# Mock vector and metadata objects (you would bring your own)
vector = [0.010, 2.34,...] # len(vector) = 1536
metadata = {"id": 3056, "description": "Networked neural adapter"}

# Upsert your vector(s)
index.upsert(
  vectors=[
    {"id": "some_id", "values": vector, "metadata": metadata}
  ]
)

Testing

To test NilVec, run:

zig build test

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nilvec-0.1.0.tar.gz (423.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nilvec-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (277.6 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

nilvec-0.1.0-cp310-cp310-macosx_11_0_arm64.whl (240.1 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file nilvec-0.1.0.tar.gz.

File metadata

  • Download URL: nilvec-0.1.0.tar.gz
  • Upload date:
  • Size: 423.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.11

File hashes

Hashes for nilvec-0.1.0.tar.gz
Algorithm Hash digest
SHA256 434cdc5ae4db4cd3f1380970a7237be8ca1a73f66378cef500c7dde02992da4d
MD5 32768ba5181b468c0f42e6904f429a1d
BLAKE2b-256 71a47b871dd066fee0b2e464cb0f0887e229ee75cd681156eb6923aae07b421e

See more details on using hashes here.

File details

Details for the file nilvec-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nilvec-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 197d4cf6c246c88b58f19963875b0236737295cd8be0cdce8477f5bfca7da499
MD5 27ec963f6482af4ab8e56617ccc20455
BLAKE2b-256 ab4934b4045cb999c07a2a53d86f91a974f98b5d1f31219043aa643dcee25daf

See more details on using hashes here.

File details

Details for the file nilvec-0.1.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for nilvec-0.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8cfef5a05251ddce422a8878541e525c1e543564b6a58688a92c8e080f7ba6d6
MD5 f85b1773038eee43f83e4ddf375c1104
BLAKE2b-256 6cc2e0a634245bf11bb9d341f745ed484f2457b966fa19852f027cd3d26c2621

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page