Skip to main content

LangChain VectorStore integration for VAST Database

Project description

langchain-vastdb

LangChain VectorStore integration for VAST Database.

langchain-vastdb provides a VastDBVectorStore class that implements the LangChain VectorStore interface, enabling similarity search, document storage, and retrieval-augmented generation (RAG) workflows backed by VAST Database's native vector indexing.

Compatibility: Python 3.10 - 3.13 | langchain-core >= 1.0, < 2 | vastdb >= 2.0.3

Status: Alpha (v0.0.1). API may change between minor releases.

License: Apache-2.0

Requirements

  • Python 3.10+
  • A running VAST Database cluster with vector index support
  • vastdb SDK >= 2.0.3
  • langchain-core >= 1.0, < 2
  • An Embeddings model (e.g., OpenAI, HuggingFace, or any LangChain-compatible embeddings)

Installation

pip install langchain-vastdb

Or with uv:

uv add langchain-vastdb

Quickstart

Option 1: Pass a pre-built session

import vastdb
from langchain_vastdb import VastDBVectorStore

session = vastdb.connect(
    endpoint="http://vast-cluster:8070",
    access="YOUR_ACCESS_KEY",
    secret="YOUR_SECRET_KEY",
)

store = VastDBVectorStore(
    embedding=my_embeddings,
    session=session,
    bucket="my-bucket",
    schema="my-schema",
    table_name="my-table",
)

# Add documents and search
ids = store.add_texts(["Paris is the capital of France."])
results = store.similarity_search("capital city", k=1)
print(results[0].page_content)

Option 2: Use the convenience factory

from langchain_vastdb import VastDBVectorStore

store = VastDBVectorStore.from_connection_params(
    embedding=my_embeddings,
    endpoint="http://vast-cluster:8070",
    access_key="YOUR_ACCESS_KEY",
    secret_key="YOUR_SECRET_KEY",
    bucket="my-bucket",
    schema="my-schema",
    table_name="my-table",
)

Credentials are passed directly to vastdb.connect() and are not stored on the instance.

Option 3: Create a store and add texts in one call

import vastdb
from langchain_vastdb import VastDBVectorStore

session = vastdb.connect(
    endpoint="http://vast-cluster:8070",
    access="YOUR_ACCESS_KEY",
    secret="YOUR_SECRET_KEY",
)

store = VastDBVectorStore.from_texts(
    texts=["Paris is the capital of France.", "Berlin is the capital of Germany."],
    embedding=my_embeddings,
    session=session,
    bucket="my-bucket",
    schema="my-schema",
    table_name="my-table",
)

CRUD Operations

# Add documents with metadata
ids = store.add_texts(
    ["Some text", "More text"],
    metadatas=[{"source": "wiki"}, {"source": "blog"}],
)

# Similarity search by text query
docs = store.similarity_search("capital city", k=2)

# Similarity search with distance scores
scored = store.similarity_search_with_score("capital city", k=2)
for doc, score in scored:
    print(f"{doc.page_content} (distance: {score})")

# Search with a pre-computed vector
docs = store.similarity_search_by_vector([0.1, 0.2, ...], k=2)

# Retrieve documents by ID
docs = store.get_by_ids(ids)

# Delete by ID
store.delete(ids=ids)

Using as a retriever

VastDBVectorStore integrates directly with LangChain's retriever interface:

retriever = store.as_retriever(search_kwargs={"k": 3})
docs = retriever.invoke("What is the capital of France?")

This works seamlessly in LCEL RAG chains:

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

retriever = store.as_retriever(search_kwargs={"k": 3})
prompt = ChatPromptTemplate.from_template(
    "Answer based on context:\n{context}\n\nQuestion: {question}"
)

def format_docs(docs):
    return "\n".join(d.page_content for d in docs)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm  # any LangChain-compatible LLM
    | StrOutputParser()
)
answer = chain.invoke("What is the capital of France?")

Cache management

VastDBVectorStore caches table metadata after the first access to avoid repeated bucket/schema/table round trips. If you alter the table structure externally, invalidate the cache:

store.invalidate_table_cache()

Configuration Reference

Constructor: VastDBVectorStore(...)

Parameter Type Default Description
embedding Embeddings required The embeddings model used to generate vectors.
session vastdb.Session required A pre-built session connected to the VAST cluster.
bucket str required The VAST bucket name containing the target table.
schema str required The schema name within the bucket.
table_name str required The table name for vector operations.
id_column str "id" Column name for document IDs.
text_column str "text" Column name for document text.
vector_column str "vector" Column name for embedding vectors.
metadata_column str "metadata" Column name for document metadata (stored as JSON).
adbc_driver_path str | None None Path to libadbc_driver_vastdb.so. Enables native ADBC vector search via array_distance() SQL.
adbc_endpoint str | None None ADBC/QueryEngine endpoint (hostname or IP). Separate from the HTTP REST endpoint.
access_key str | None None Access key for ADBC connection.
secret_key str | None None Secret key for ADBC connection.

Custom column names

Column names default to id, text, vector, and metadata. Override them at construction time:

store = VastDBVectorStore(
    embedding=my_embeddings,
    session=session,
    bucket="my-bucket",
    schema="my-schema",
    table_name="my-table",
    id_column="doc_id",
    text_column="content",
    vector_column="emb",
    metadata_column="meta",
)

Factory classmethod: from_connection_params(...)

Creates a VastDBVectorStore by building a vastdb.Session internally from connection parameters.

Parameter Type Default Description
embedding Embeddings required The embeddings model.
endpoint str required The VAST cluster HTTP endpoint URL.
access_key str required Access key for authentication.
secret_key str required Secret key for authentication.
bucket str required The VAST bucket name.
schema str required The schema name within the bucket.
table_name str required The table name for vector operations.
adbc_driver_path str | None None Path to ADBC driver shared library.
adbc_endpoint str | None None ADBC/QueryEngine endpoint.
**kwargs Additional keyword arguments forwarded to the constructor (e.g., custom column names).

ADBC vector search

When adbc_driver_path and adbc_endpoint are both provided, the store uses native ADBC SQL with array_distance() for server-side vector search. This does not require a vector index on the table. If ADBC is unavailable or fails, the store falls back to an in-memory L2Sq distance scan.

store = VastDBVectorStore(
    embedding=my_embeddings,
    session=session,
    bucket="my-bucket",
    schema="my-schema",
    table_name="my-table",
    adbc_driver_path="/usr/lib/libadbc_driver_vastdb.so",
    adbc_endpoint="query-engine.example.com",
    access_key="YOUR_ACCESS_KEY",
    secret_key="YOUR_SECRET_KEY",
)

Subclassing Guide

VastDBVectorStore uses the Template Method pattern. Public methods like add_texts and similarity_search handle embedding, filter conversion, and result formatting, then delegate storage operations to five protected hook methods. Override these hooks to customize behavior without reimplementing the full LangChain interface.

Hook methods

Hook Purpose Returns
_insert_vectors Customize record insertion list[str] (IDs)
_build_metadata_columns Customize column layout for metadata dict[str, list]
_select_columns Customize columns retrieved during search list[str]
_vector_search Customize similarity search list[tuple[dict, float]]
_delete_by_ids Customize document deletion bool
_get_by_ids Customize document retrieval list[dict]
_row_to_document Customize row-to-Document conversion Document

Hook signatures

def _insert_vectors(
    self,
    texts: list[str],
    embeddings: list[list[float]],
    metadatas: list[dict],
    ids: list[str],
    *,
    tx: Transaction | None = None,
) -> list[str]: ...

def _vector_search(
    self,
    query_vector: list[float],
    k: int,
    predicate: ibis.Expr | None = None,
    *,
    filter_dict: dict | None = None,
    tx: Transaction | None = None,
) -> list[tuple[dict, float]]: ...

def _delete_by_ids(
    self,
    ids: list[str],
    *,
    tx: Transaction | None = None,
) -> bool: ...

def _get_by_ids(
    self,
    ids: list[str],
    *,
    tx: Transaction | None = None,
) -> list[dict]: ...

def _row_to_document(
    self,
    row: dict,
    score: float | None = None,
) -> Document: ...

Transaction reuse

Each hook opens and closes its own transaction by default. The optional tx parameter lets subclasses pass in an existing transaction for multi-step atomic operations:

with self._session.transaction() as tx:
    self._insert_vectors(texts, embeddings, metadatas, ids, tx=tx)
    # additional operations in the same transaction

Example: typed metadata columns

The base class stores metadata as a single JSON string column. If you need typed columns for performance-critical filtering, set _typed_metadata_columns:

from langchain_vastdb import TypedColumn, VastDBVectorStore


class TypedMetadataStore(VastDBVectorStore):
    """Store with typed 'category' and 'priority' metadata columns."""

    _typed_metadata_columns = {
        "category": TypedColumn(),
        "priority": TypedColumn(),
    }

This automatically extracts category and priority into separate typed columns on insert, preserves any extra metadata in the JSON column, and merges everything back together on read. The public LangChain interface (add_texts, similarity_search, etc.) stays unchanged.

Use TypedColumn fields for custom defaults, PyArrow type coercion, or controlling which columns are backfilled on read (see the Migration Guide for details).

Examples

See the examples/ directory for runnable scripts:

  • basic_usage.py -- add texts, search, retrieve
  • rag_pipeline.py -- as_retriever() + LCEL RAG chain
  • subclassing.py -- declarative typed metadata columns
  • filtered_search.py -- metadata filtering patterns

Migration Guide

Migrating an existing VectorStore subclass to VastDBVectorStore? See the Migration Guide for step-by-step instructions, a hook mapping table, and a before/after code comparison.

Development

Clone the repository and install dependencies with uv:

uv sync

Run the linter:

uv run ruff check .

Run unit tests:

uv run pytest tests/unit_tests/

Run integration tests (requires a VAST cluster):

uv run pytest tests/integration_tests/

License

Apache-2.0 -- see LICENSE for details. test sync

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_vastdb-0.0.5.tar.gz (149.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_vastdb-0.0.5-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file langchain_vastdb-0.0.5.tar.gz.

File metadata

  • Download URL: langchain_vastdb-0.0.5.tar.gz
  • Upload date:
  • Size: 149.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for langchain_vastdb-0.0.5.tar.gz
Algorithm Hash digest
SHA256 952424859421b9a2c59585490f80f3f3945850bfbce12f38decdee6f6f6f8a6b
MD5 619fb31113f4e39c7e7c80a037526176
BLAKE2b-256 96abaf42c48d300870a593e4d28557a73088f0bb1b681dfbfaaadba6f8456f42

See more details on using hashes here.

File details

Details for the file langchain_vastdb-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: langchain_vastdb-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 26.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for langchain_vastdb-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 775a33c1f06dfc8238b6ac050c13c1e355a3697076555846a51029b5a69142d2
MD5 4de6b15c47702f113cfaae7d5a63939c
BLAKE2b-256 65b8695ace48d250af2771a7fdda462b61fc6804fd598b618286ed8cb069b1d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page