Skip to main content

A tiny and fast local vector database in Python.

Project description

picovdb


An extremely fast, ultra-lightweight local vector database in Python.

"extremely fast": sub-millisecond query

"ultra-lighweight": One file with only Numpy and one optional dependency faiss-cpu. (See faiss note at the end)

Install

pip install picovdb

Usage

Create a db:

(Use SentenceTransformer embedding as example)

from picovdb import PicoVectorDB  # On Mac, import before any libs that use pytorch
from sentence_transformers import SentenceTransformer

CHUNK_SIZE = 256
model = SentenceTransformer('all-MiniLM-L6-v2')
dim = model.get_sentence_embedding_dimension()

with open('A_Christmas_Carol.txt', encoding='UTF8') as f:
    content = f.read()
    num_chunks = len(content) // CHUNK_SIZE + 1
    chunks = [content[i * CHUNK_SIZE: (i + 1) * CHUNK_SIZE] for i in range(num_chunks)]
    embeddings = model.encode(chunks)
    data = [
        {
            "_vector_": embeddings[i],
            "_id_": i,
            "content": chunks[i],
        }
        for i in range(num_chunks)
    ]
    db = PicoVectorDB(embedding_dim=dim, storage_file='_acc')
    db.upsert(data)
    db.save()

Query

from picovdb import PicoVectorDB
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
dim = model.get_sentence_embedding_dimension()

db = PicoVectorDB(embedding_dim=dim, storage_file='_acc')
txt = "Are there no prisons? Are there no workhouses?"
emb = model.encode(txt)
q = db.query(emb, top_k=3)
print('query results:', q)

Benchmark

Embedding Dim: 1024.

Environment: M3 MacBook Air

  1. Pure Python:

    • Inserting 100,000 vectors took about 0.5s
    • Doing 100 queries from 100,000 vectors took roughly 0.8s (0.008s per quiry).
    • Doing 1000 queries from 100,000 vectors in batch mode took 1.0s (0.001s or 1 millisecond per quiry).
  2. With FAISS(cpu):

    • Inserting 100,000 vectors took 110s
    • Doing 100 queries from 100,000 vectors took 0.04s (0.0004s or 0.4 millisecond per quiry).
    • Doing 1000 queries from 100,000 vectors in batch mode took 0.1s (0.0001s or 0.1 millisecond per quiry).

Environment: Windows PC with CPU Core i7-12700k and old-gen M2 Nvme SSD

  1. Pure Python:

    • Inserting 100,000 vectors took about 0.7s
    • Doing 100 queries from 100,000 vectors took roughly 1.5s (0.015s per quiry).
    • Doing 1000 queries from 100,000 vectors in batch mode took 1.0s (0.001s or 1 millisecond per quiry).
  2. With FAISS(cpu):

    • Inserting 100,000 vectors took 50s
    • Doing 100 queries from 100,000 vectors took 0.04s (0.0004s or 0.4 millisecond per quiry).
    • Doing 1000 queries from 100,000 vectors in batch mode took 0.16s (0.00016s or 0.16 millisecond per quiry).

FAISS Note

On MacOS, if you use FAISS, please do one of following:

  • import picovdb before any libraries that use pytorch (e.g. sentence_transformers, transformers, etc) or any packages that use OpenMP.
  • set faiss_threads to 1 when initializing PicoVectorDB, e.g. PicoVectorDB(..., faiss_threads=1).
  • set env var PICOVDB_FAISS_THREADS=1 before running your script.

Faiss >=1.10 will segfault on Darwin when using HNSW index with OpenMP multithreading. This is a known issue with FAISS on macOS.

On Windows and Linux, Faiss works fine with other libs that use OpenMP, no special care is needed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

picovdb-0.2.1.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

picovdb-0.2.1-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file picovdb-0.2.1.tar.gz.

File metadata

  • Download URL: picovdb-0.2.1.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.12.5 Linux/5.15.0-153-generic

File hashes

Hashes for picovdb-0.2.1.tar.gz
Algorithm Hash digest
SHA256 ee67a0089fc12e97d90baeea7b1945940239c87f1345df418ebe526a3b12d320
MD5 6fa2dee92bdeec041c65c35a3f7b44dc
BLAKE2b-256 8f3e136b616b9ab12cabf2155b4655f710cad507531b459d73ae8cdbe6c98e12

See more details on using hashes here.

File details

Details for the file picovdb-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: picovdb-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.12.5 Linux/5.15.0-153-generic

File hashes

Hashes for picovdb-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6b172c614b3b2802422a5cafa4ce4f7940daa669dd951b5a938d58dec465366f
MD5 5e529a3b9e282dac536989f97bb74502
BLAKE2b-256 80cd99f6ee6271ccff018c488aee71e1b0a406a2ec6e662ebfe8c3a1d55d8227

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page