Skip to main content

Efficient vector DB on large datasets from disk, using minimal memory.

Project description

DiskVectorIndex - Ultra-Low Memory Vector Search on Large Dataset

Indexing large datasets (100M+ embeddings) requires a lot of memory in most vector databases: For 100M documents/embeddings, most vector databases require about 500GB of memory, driving the cost for your servers accordingly high.

This repository offers methods to be able to search on very large datasets (100M+) with just 300MB of memory, making semantic search on such large datasets suitable for the Memory-Poor developers.

We provide various pre-build indices, that can be used to semantic search and powering your RAG applications.

Pre-Build Indices

Below you find different pre-build indices. The embeddings are downloaded at the first call, the size is specified under Index Size. Most of the embeddings are memory mapped from disk, e.g. for the Cohere/trec-rag-2024-index corpus you need 15 GB of disk, but just 380 MB of memory to load the index.

Name Description #Docs Index Size (GB) Memory Needed
Cohere/trec-rag-2024-index Segmented corpus for TREC RAG 2024 113,520,750 15GB 380MB
fineweb-edu-10B-index (soon) 10B token sample from fineweb-edu embedded and indexed on document level. 9,267,429 1.4GB 230MB
fineweb-edu-100B-index (soon) 100B token sample from fineweb-edu embedded and indexed on document level. 69,672,066 9.2GB 380MB
fineweb-edu-350B-index (soon) 350B token sample from fineweb-edu embedded and indexed on document level. 160,198,578 21GB 380MB
fineweb-edu-index (soon) Full 1.3T token dataset fineweb-edu embedded and indexed on document level. 324,322,256 42GB 285MB

Each index comes with the respective corpus, that is chunked into smaller parts. These chunks are downloaded on-demand and reused for further queries.

Getting Started

Get your free Cohere API key from cohere.com. You must set this API key as an environment variable:

export COHERE_API_KEY=your_api_key

Install the package:

pip install DiskVectorIndex

You can then search via:

from DiskVectorIndex import DiskVectorIndex

index = DiskVectorIndex("Cohere/trec-rag-2024-index")

while True:
    query = input("\n\nEnter a question: ")
    docs = index.search(query, top_k=3)
    for doc in docs:
        print(doc)
        print("=========")

You can also load a fully downloaded index from disk via:

from DiskVectorIndex import DiskVectorIndex

index = DiskVectorIndex("path/to/index")

How does it work?

The Cohere embeddings have been optimized to work well in compressed vector space, as detailed in our Cohere int8 & binary Embeddings blog post. The embeddings have not only been trained to work in float32, which requires a lot of memory, but to also operate well with int8, binary and Product Quantization (PQ) compression.

The above indices uses Product Quantization (PQ) to go from originally 1024*4=4096 bytes per embedding to just 128 bytes per embedding, reducing your memory requirement 32x.

Further, we use faiss with a memory mapped IVF: In this case, only a small fraction (between 32,768 and 131,072) embeddings must be loaded in memory.

Need Semantic Search at Scale?

At Cohere we helped customers to run Semantic Search on tens of billions of embeddings, at a fraction of the cost. Feel free to reach out for Nils Reimers if you need a solution that scales.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DiskVectorIndex-0.0.2.tar.gz (9.3 kB view details)

Uploaded Source

File details

Details for the file DiskVectorIndex-0.0.2.tar.gz.

File metadata

  • Download URL: DiskVectorIndex-0.0.2.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for DiskVectorIndex-0.0.2.tar.gz
Algorithm Hash digest
SHA256 6d09667935623f90d315df0fb252b38ab59ed922e172722436d6d754c8edc2f5
MD5 587bc2d06e674ad5e45aa065b1a03e85
BLAKE2b-256 473031f764edf001e4fc1004ed4de7ce6d220745396158349ed272150e598402

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page