Skip to main content

S3 vector database for bigdata

Project description

VectorLake

VectorLake is a robust, vector database designed for low maintenance, cost, efficient storage and ANN querying of any size vector data distributed across S3 files.

GitHub Contributors GitHub Last Commit GitHub Issues GitHub Pull Requests Github License

🏷 Features

  • Inspired by article Which Vector Database Should I Use? A Comparison Cheatsheet

  • VectorLake created with tradeoff to minimize db maintenance, cost and provide custom data partitioning strategies

  • Native Big Data Support: Specifically designed to handle large datasets, making it ideal for big data projects.

  • Vector Data Handling: Capable of storing and querying high-dimensional vectors, commonly used for embedding storage in machine learning projects.projects.

  • Efficient Search: Efficient nearest neighbors search, ideal for querying similar vectors in high-dimensional spaces. This makes it especially useful for querying for similar vectors in a high-dimensional space.

  • Data Persistence: Supports data persistence on disk, network volume and S3, enabling long-term storage and retrieval of indexed data.

  • Customizable Partitioning: Trade-off design to minimize database maintenance, cost, and provide custom data partitioning strategies.

  • Native support of LLM Agents.

  • Feature store for experimental data.

📦 Installation

To get started with VectorLake, simply install the package using pip:

pip install vector_lake

⛓️ Quick Start

import numpy as np
from vector_lake import VectorLake

db = VectorLake(location="s3://vector-lake", dimension=5, approx_shards=243)
N = 100  # for example
D = 5  # Dimensionality of each vector
embeddings = np.random.rand(N, D)

for em in embeddings:
    db.add(em, metadata={}, document="some document")
db.persist()

db = VectorLake(location="s3://vector-lake", dimension=5, approx_shards=243)
# re-init test
db.query([0.56325391, 0.1500543, 0.88579166, 0.73536349, 0.7719873])

Custom feature partition

Custom partition to group features by custom category

import numpy as np
from vector_lake.core.index import Partition

if __name__ == "__main__":
    db = Partition(location="s3://vector-lake", partition_key="feature", dimension=5)
    N = 100  # for example
    D = 5  # Dimensionality of each vector
    embeddings = np.random.rand(N, D)

    for em in embeddings:
        db.add(em, metadata={}, document="some document")
    db.persist()

    db = Partition(location="s3://vector-lake", key="feature", dimension=5)
    # re-init test
    db.buckets
    db.query([0.56325391, 0.1500543, 0.88579166, 0.73536349, 0.7719873])

Local persistent volume

import numpy as np
from vector_lake import VectorLake

db = VectorLake(location="/mnt/db", dimension=5, approx_shards=243)
N = 100  # for example
D = 5  # Dimensionality of each vector
embeddings = np.random.rand(N, D)

for em in embeddings:
    db.add(em, metadata={}, document="some document")
db.persist()

db = VectorLake(location="/mnt/db", dimension=5, approx_shards=243)
# re-init test
db.query([0.56325391, 0.1500543, 0.88579166, 0.73536349, 0.7719873])

Langchain Retrieval

from langchain.document_loaders import TextLoader
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from vector_lake.langchain import VectorLakeStore

loader = TextLoader("Readme.md")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
db = VectorLakeStore.from_documents(documents=docs, embedding=embedding)

query = "What is Vector Lake?"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

Why VectorLake?

VectorLake gives you the functionality of a simple, resilient vector database, but with very easy setup and low operational overhead. With it you've got a lightweight and reliable distributed vector store.

VectorLake leverages Hierarchical Navigable Small World (HNSW) for data partitioning across all vector data shards. This ensures that each modification to the system aligns with vector distance. You can learn more about the design here.

Limitations

TBD

🛠️ Roadmap

👋 Contributing

Contributions to VectorLake are welcome! If you'd like to contribute, please follow these steps:

  • Fork the repository on GitHub
  • Create a new branch for your changes
  • Commit your changes to the new branch
  • Push your changes to the forked repository
  • Open a pull request to the main VectorLake repository

Before contributing, please read the contributing guidelines.

Development (uv)

VectorLake uses uv for dependency management.

# install with dev tools; add --extra s3 if you need S3 support locally
uv sync --extra dev

# run tests via uv
uv run pytest

# update the lockfile after dependency changes
uv lock

License

VectorLake is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vector_lake-0.0.5.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vector_lake-0.0.5-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file vector_lake-0.0.5.tar.gz.

File metadata

  • Download URL: vector_lake-0.0.5.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for vector_lake-0.0.5.tar.gz
Algorithm Hash digest
SHA256 807ef3dd84d73b9faa295c6e29ec8c98be16549dfb60a400afd70853f89f7caa
MD5 1c01ea1bf9b546a353add4a2f5d04aed
BLAKE2b-256 162fd3cdca6ec1f0fb284578868873e52f2d716df1e47eaec0d30ee6569d22f2

See more details on using hashes here.

File details

Details for the file vector_lake-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: vector_lake-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for vector_lake-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 d4ac2f6db70d7409543e7cd192a8635329b1c68b9db5866002921ac280ecb6f2
MD5 7cfbc26de22ecee154709934241aad8f
BLAKE2b-256 7b17ca57c62c122fc0b80cc2a380bb874be010a0feb27be98a73561cabb2c923

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page