Couchbase haystack integration

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cb-devadvocates

These details have not been verified by PyPI

Project description

couchbase-haystack

A Haystack Document Store for Couchbase.

Table of Contents

Overview
Installation
Usage
License

Overview

An integration of Couchbase NoSQL database with Haystack v2.0 by deepset. In Couchbase Vector search index is being used for indexing document embeddings and dense retrievals.

The library allows using Couchbase as a DocumentStore, and implements the required Protocol methods. You can start working with the implementation by importing it from couchbase_haystack package:

from couchbase_haystack import CouchbaseSearchDocumentStore

In addition to the CouchbaseSearchDocumentStore the library includes the following haystack components which can be used in a pipeline:

CouchbaseSearchEmbeddingRetriever - is a typical retriever component which can be used to query vector store index and find related Documents. The component uses CouchbaseSearchDocumentStore to query embeddings.

The couchbase-haystack library uses Python Driver.

CouchbaseSearchDocumentStore will store Documents as JSON documents in Couchbase. Embeddings are stored as part of the document, with indexing and querying of vector embeddings managed by Couchbase's dedicated Vector Search Index. The document store supports both scope-level and global-level vector search indexes:

Scope-level indexes (default): The vector search index is created at the scope level and only searches documents within that scope
Global-level indexes: The vector search index is created at the bucket level and can search across all scopes and collections in the bucket

                                   +-----------------------------+
                                   |       Couchbase Database    |
                                   +-----------------------------+
                                   |                             |
                                   |      +----------------+     |
                                   |      |  Data service  |     |
                write_documents    |      +----------------+     |
          +------------------------+----->|   properties   |     |
          |                        |      |                |     |
+---------+--------------+         |      |   embedding    |     |
|                        |         |      +--------+-------+     |
| CouchbaseSearchDocumentStore |         |               |             |
|                        |         |               |index        |
+---------+--------------+         |               |             |
          |                        |      +--------+--------+    |
          |                        |      |  Search service |    |
          |                        |      +-----------------+    |
          +----------------------->|      |       FTS       |    |
               query_embeddings    |      |   Vector Index  |    |
                                   |      | (for embedding) |    |
                                   |      +-----------------+    |
                                   |                             |
                                   +-----------------------------+

In the above diagram:

Data service Supports the storing, setting, and retrieving of documents, specified by key. Basically where the documents are stored in key value.
properties are Document attributes stored as part of the Document.
embedding is also a property of the Document (just shown separately in the diagram for clarity) which is a vector of type LIST[FLOAT].
Search service Where indexes specially purposed for Full Text Search and Vector search are created. The Search Service allows for efficient querying and retrieval based on both text content and vector embeddings.

CouchbaseSearchDocumentStore requires the vector index to be created manually either by sdk or UI. Before writing documents you should make sure Documents are embedded by one of the provided embedders. For example SentenceTransformersDocumentEmbedder can be used in indexing pipeline to calculate document embeddings before writing those to Couchbase.

Installation

couchbase-haystack can be installed as any other Python library, using pip:

pip install --upgrade pip # optional
pip install sentence-transformers # required in order to run pipeline examples given below
pip install couchbase-haystack

Usage

Running Couchbase

You will need a running instance of Couchbase to use the components from this package. There are several options available:

Docker
Couchbase Cloud - a fully managed cloud service
Couchbase Server - installable on various operating systems

The simplest way to start the database locally is with a Docker container:

docker run \
    --restart always \
    --publish=8091-8096:8091-8096 --publish=11210:11210 \
    --env COUCHBASE_ADMINISTRATOR_USERNAME=admin \
    --env COUCHBASE_ADMINISTRATOR_PASSWORD=passw0rd \
    couchbase:enterprise-7.6.2

In this example, the container is started using Couchbase Server version 7.6.2. The COUCHBASE_ADMINISTRATOR_USERNAME and COUCHBASE_ADMINISTRATOR_PASSWORD environment variables set the default credentials for authentication.

Note:
Assuming you have a Docker container running, navigate to http://localhost:8091 to open the Couchbase Web Console and explore your data.

Document Store

Once you have the package installed and the database running, you can start using CouchbaseSearchDocumentStore as any other document stores that support embeddings.

from couchbase_haystack import CouchbaseSearchDocumentStore, CouchbasePasswordAuthenticator
from haystack.utils.auth import Secret

document_store = CouchbaseSearchDocumentStore(
    cluster_connection_string=Secret.from_env_var("CB_CONNECTION_STRING"),
    authenticator=CouchbasePasswordAuthenticator(
      username=Secret.from_env_var("CB_USERNAME"),
      password=Secret.from_env_var("CB_PASSWORD")
    ),
    bucket = "haystack_bucket_name",
    scope="haystack_scope_name",
    collection="haystack_collection_name",
    vector_search_index = "vector_search_index",
    is_global_level_index=False  # Enables scope-level vector search index by default
)

Assuming there is a list of documents available and a running couchbase database you can write/index those in Couchbase, e.g.:

from haystack import Document

documents = [Document(content="Alice has been living in New York City for the past 5 years.")]

document_store.write_documents(documents)

If you intend to obtain embeddings before writing documents use the following code:

from haystack import Document

# import one of the available document embedders
from haystack.components.embedders import SentenceTransformersDocumentEmbedder 

documents = [Document(content="Alice has been living in New York City for the past 5 years.")]

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_embedder.warm_up() # will download the model during first run
documents_with_embeddings = document_embedder.run(documents)

document_store.write_documents(documents_with_embeddings.get("documents"))

Make sure embedding model produces vectors of same size as it has been set on Couchbase Vector Index, e.g. setting embedding_dim=384 would comply with the "sentence-transformers/all-MiniLM-L6-v2" model.

Note Most of the time you will be using Haystack Pipelines to build both indexing and querying RAG scenarios.

It is important to understand how haystack Documents are stored in Couchbase after you call write_documents.

from random import random

sample_embedding = [random() for _ in range(384)]  # using fake/random embedding for brevity here to simplify example
document = Document(
    content="Alice has been living in New York City for the past 5 years.", embedding=sample_embedding, meta={"num_of_years": 5}
)
document.to_dict()

The above code converts a Document to a dictionary and will render the following output:

>>> output:
{
    "id": "11c255ad10bff4286781f596a5afd9ab093ed056d41bca4120c849058e52f24d",
    "content": "Alice has been living in New York City for the past 5 years.",
    "dataframe": None,
    "blob": None,
    "score": None,
    "embedding": [0.025010755222666936, 0.27502931836911926, 0.22321073814882275, ...], # vector of size 384
    "num_of_years": 5,
}

The data from the dictionary will be used to create a document in COuchbase after you write the document with document_store.write_documents([document]). You could query it with Cypher, e.g. MATCH (doc:Document) RETURN doc. Below is a json document Couchbase:

{
  "id": "11c255ad10bff4286781f596a5afd9ab093ed056d41bca4120c849058e52f24d",
  "embedding": [0.6394268274307251, 0.02501075528562069,0.27502933144569397, ...], // vector of size 384
  "content": "Alice has been living in New York City for the past 5 years.",
  "meta": {
    "num_of_years": 5
  }
}

The full list of parameters accepted by CouchbaseSearchDocumentStore can be found in API documentation.

Indexing documents

With Haystack you can use DocumentWriter component to write Documents into a Document Store. In the example below we construct pipeline to write documents to Couchbase using CouchbaseSearchDocumentStore:

from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.pipeline import Pipeline
from haystack.utils.auth import Secret

from couchbase_haystack import CouchbaseSearchDocumentStore, CouchbasePasswordAuthenticator

documents = [Document(content="This is document 1"), Document(content="This is document 2")]

document_store = CouchbaseSearchDocumentStore(
    cluster_connection_string=Secret.from_env_var("CB_CONNECTION_STRING"),
    authenticator=CouchbasePasswordAuthenticator(
      username=Secret.from_env_var("CB_USERNAME"),
      password=Secret.from_env_var("CB_PASSWORD")
    ),
    bucket = "haystack_bucket_name",
    scope="haystack_scope_name",
    collection="haystack_collection_name",
    vector_search_index = "vector_search_index"
)
embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store=document_store)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=embedder, name="embedder")
indexing_pipeline.add_component(instance=document_writer, name="writer")

indexing_pipeline.connect("embedder", "writer")
indexing_pipeline.run({"embedder": {"documents": documents}})

>>> output:
`{'writer': {'documents_written': 2}}`

Retrieving documents

CouchbaseSearchEmbeddingRetriever component can be used to retrieve documents from Couchbase by querying vector index using an embedded query. Below is a pipeline which finds documents using query embedding:

from typing import List
from haystack.utils.auth import Secret
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder

from couchbase_haystack.document_store import CouchbaseSearchDocumentStore, CouchbasePasswordAuthenticator
from couchbase_haystack.component.retriever import CouchbaseSearchEmbeddingRetriever

document_store = CouchbaseSearchDocumentStore(
    cluster_connection_string=Secret.from_env_var("CB_CONNECTION_STRING"),
    authenticator=CouchbasePasswordAuthenticator(
      username=Secret.from_env_var("CB_USERNAME"),
      password=Secret.from_env_var("CB_PASSWORD")
    ),
    bucket = "haystack_bucket_name",
    scope="haystack_scope_name",
    collection="haystack_collection_name",
    vector_search_index = "vector_search_index"
)

documents = [
    Document(content="Alice has been living in New York City for the past 5 years.", meta={"num_of_years": 5, "city": "New York"}),
    Document(content="John moved to Los Angeles 2 years ago and loves the sunny weather.", meta={"num_of_years": 2, "city": "Los Angeles"}),
]

# Same model is used for both query and Document embeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"

document_embedder = SentenceTransformersDocumentEmbedder(model=model_name)
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)

document_store.write_documents(documents_with_embeddings.get("documents"))

print("Number of documents written: ", document_store.count_documents())

pipeline = Pipeline()
pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model=model_name))
pipeline.add_component("retriever", CouchbaseSearchEmbeddingRetriever(document_store=document_store))
pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

result = pipeline.run(
    data={
        "text_embedder": {"text": "What cities do people live in?"},
        "retriever": {
            "top_k": 5
        },
    }
)

documents: List[Document] = result["retriever"]["documents"]

>>> output:
[Document(id=3e35fa03aff6e3c45e6560f58adc4fde3c436c111a8809c30133b5cb492e8694, content: 'Alice has been living in New York City for the past 5 years.', meta: {'num_of_years': 5, 'city': 'New York'}, score: 0.36796408891677856, embedding: "embedding": vector of size 384), Document(id=ca4d7d7d7ff6c13b950a88580ab134b2dc15b48a47b8f571a46b354b5344e5fa, content: 'John moved to Los Angeles 2 years ago and loves the sunny weather.', meta: {'num_of_years': 2, 'city': 'Los Angeles'}, score: 0.3126790523529053, embedding: vector of size 384)]

More examples

You can find more examples in the implementation repository:

indexing_pipeline.py - Indexing text files (documents) from a remote http location.
rag_pipeline.py - Generative question answering RAG pipeline using CouchbaseSearchEmbeddingRetriever to fetch documents from Couchbase document store and answer question using HuggingFaceAPIGenerator.

License

couchbase-haystack is distributed under the terms of the MIT license.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cb-devadvocates

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.1.1

Mar 12, 2026

2.1.0

Oct 27, 2025

This version

2.0.0

Feb 28, 2025

1.0.0

Feb 21, 2025

0.1.0

Dec 19, 2024

0.0.6

Aug 22, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

couchbase_haystack-2.0.0.tar.gz (179.0 kB view details)

Uploaded Feb 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

couchbase_haystack-2.0.0-py3-none-any.whl (23.0 kB view details)

Uploaded Feb 28, 2025 Python 3

File details

Details for the file couchbase_haystack-2.0.0.tar.gz.

File metadata

Download URL: couchbase_haystack-2.0.0.tar.gz
Upload date: Feb 28, 2025
Size: 179.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for couchbase_haystack-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`9fec86fd1404453736a46f5528d8cda87e972c80f4dc51db1413373c1b2feca4`
MD5	`d92f46b45c28769542611d8c42873787`
BLAKE2b-256	`31e3dbd94cc9fadd61fd08f88ec6e5a76f37a39d00cfd64edfe2ed703b9301b0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for couchbase_haystack-2.0.0.tar.gz:

Publisher: realease.yml on Couchbase-Ecosystem/couchbase-haystack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: couchbase_haystack-2.0.0.tar.gz
- Subject digest: 9fec86fd1404453736a46f5528d8cda87e972c80f4dc51db1413373c1b2feca4
- Sigstore transparency entry: 175340590
- Sigstore integration time: Feb 28, 2025
Source repository:
- Permalink: Couchbase-Ecosystem/couchbase-haystack@b43fceecf66344f092a89ac2ea702a8fc7ac9959
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/Couchbase-Ecosystem
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: realease.yml@b43fceecf66344f092a89ac2ea702a8fc7ac9959
- Trigger Event: push

File details

Details for the file couchbase_haystack-2.0.0-py3-none-any.whl.

File metadata

Download URL: couchbase_haystack-2.0.0-py3-none-any.whl
Upload date: Feb 28, 2025
Size: 23.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for couchbase_haystack-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd52bd47122a0dee103d3b1c30915702055f0d7554ee0bdd87f723e6ca5a9bd3`
MD5	`6b8e4899efd7b8a8c3d56f1bceb8a208`
BLAKE2b-256	`752594870f9de32754959c200a5e2e7c0e11d00d78d908f4769c6dd90436f759`

See more details on using hashes here.

Provenance

The following attestation bundles were made for couchbase_haystack-2.0.0-py3-none-any.whl:

Publisher: realease.yml on Couchbase-Ecosystem/couchbase-haystack

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: couchbase_haystack-2.0.0-py3-none-any.whl
- Subject digest: fd52bd47122a0dee103d3b1c30915702055f0d7554ee0bdd87f723e6ca5a9bd3
- Sigstore transparency entry: 175340591
- Sigstore integration time: Feb 28, 2025
Source repository:
- Permalink: Couchbase-Ecosystem/couchbase-haystack@b43fceecf66344f092a89ac2ea702a8fc7ac9959
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/Couchbase-Ecosystem
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: realease.yml@b43fceecf66344f092a89ac2ea702a8fc7ac9959
- Trigger Event: push

couchbase-haystack 2.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

couchbase-haystack

Overview

Installation

Usage

Running Couchbase

Document Store

Indexing documents

Retrieving documents

More examples

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance