Skip to main content

Haystack document store and retriever for DuckDB VSS

Project description

PyPI - Version GitHub License CI Test Status

DuckDB Document Store for Haystack

[!NOTE] This project is a proof of concept - use at your own risk. The code may be susceptible to bugs and security issues (such as SQL injection), proceed with caution.

A DuckDB-backed document store for Haystack with HNSW vector search via DuckDB's VSS extension. It supports:

  • Dense embedding storage with HNSW indexing (cosine similarity, Euclidean distance, or inner product distance)
  • Filtering with Haystack-style filter dictionaries
  • In-memory operation or persistence via a DuckDB database file on disk

Installation (GitHub)

Use uv to install directly from the repository:

uv pip install "duckdb-haystack @ git+https://github.com/AdrianoKF/duckdb-haystack.git"

Usage

1) DocumentStore CRUD example

from haystack import Document

from haystack_integrations.document_stores.duckdb import DuckDBDocumentStore, document_store

store = DuckDBDocumentStore(
    database=":memory:",
    embedding_dim=3,
    similarity_metric="cosine",
)

store.write_documents(
    [
        Document(id="doc-1", content="DuckDB is fast.", embedding=[0.1, 0.0, 0.9], meta={"source": "notes"}),
        Document(id="doc-2", content="Haystack pipelines are modular.", embedding=[0.2, 0.1, 0.8]),
    ]
)

print("Total document count:", store.count_documents())

filters = {"field": "meta.source", "operator": "==", "value": "notes"}
filtered = store.filter_documents(filters=filters)
print("Filtered documents:", [doc.id for doc in filtered])

store.delete_documents(document_ids=["doc-2"])
print("After deletion:", store.filter_documents())

2) Retrieval with DuckDBRetriever in a pipeline

from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder

from haystack_integrations.document_stores.duckdb import DuckDBDocumentStore
from haystack_integrations.retrievers.duckdb import DuckDBRetriever

store = DuckDBDocumentStore(database=":memory:", embedding_dim=384)

doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder.warm_up()
documents = [
    Document(content="DuckDB stores vectors in Float arrays backed by an HNSW index."),
    Document(content="DuckDB is an analytical in-process SQL database management system."),
    Document(content="Haystack offers composable pipelines."),
]
documents = doc_embedder.run(documents=documents)["documents"]
store.write_documents(documents)

query_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
retriever = DuckDBRetriever(document_store=store)

pipeline = Pipeline()
pipeline.add_component("query_embedder", query_embedder)
pipeline.add_component("retriever", retriever)
pipeline.connect("query_embedder.embedding", "retriever.query_embedding")

result = pipeline.run(data={"query_embedder": {"text": "How does DuckDB store vectors?"}})

print(result["retriever"]["documents"][0].content)

License

duckdb-haystack is distributed under the terms of the Apache-2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duckdb_haystack-0.0.1.post1.tar.gz (125.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duckdb_haystack-0.0.1.post1-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file duckdb_haystack-0.0.1.post1.tar.gz.

File metadata

  • Download URL: duckdb_haystack-0.0.1.post1.tar.gz
  • Upload date:
  • Size: 125.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for duckdb_haystack-0.0.1.post1.tar.gz
Algorithm Hash digest
SHA256 ffa377f41c82bccf1fba43240f8f9a08eb9d276a41f2a1b202644c6c07a416e5
MD5 17d04806aa1fc4d7a2dc3b400ada5600
BLAKE2b-256 8f4a457acbef52297586758de3510098596e7964d0a46f9f01f86e3ea6d2040b

See more details on using hashes here.

File details

Details for the file duckdb_haystack-0.0.1.post1-py3-none-any.whl.

File metadata

  • Download URL: duckdb_haystack-0.0.1.post1-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for duckdb_haystack-0.0.1.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 6d4697cf3c4e1ddcc95f93d56417e96d585b79d20bb7c8c153c1af5bd070447f
MD5 d1ce8913850ae46351c13faa579030db
BLAKE2b-256 9c1bf2e9adeaac101df3fd787090271420545baf0b9a6bc11277e78f95466a59

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page