Skip to main content

Integration of Neo4j graph database with Haystack

Project description

neo4j-haystack

A Haystack Document Store for Neo4j.

ci documentation pypi version python version


Table of Contents

Overview

An integration of Neo4j graph database with Haystack v2.0 by deepset. In Neo4j Vector search index is being used for storing document embeddings and dense retrievals.

The library allows using Neo4j as a DocumentStore, and implements the required Protocol methods. You can start working with the implementation by importing it from neo4_haystack package:

from neo4_haystack import Neo4jDocumentStore

In addition to the Neo4jDocumentStore the library includes the following haystack components which can be used in a pipeline:

  • Neo4jEmbeddingRetriever - is a typical retriever component which can be used to query vector store index and find related Documents. The component uses Neo4jDocumentStore to query embeddings.
  • Neo4jDynamicDocumentRetriever is also a retriever component in a sense that it can be used to query Documents in Neo4j. However it is decoupled from Neo4jDocumentStore and allows to run arbitrary Cypher query to extract documents. Practically it is possible to query Neo4j same way Neo4jDocumentStore does, including vector search.

The neo4j-haystack library uses Python Driver and Cypher Queries to interact with Neo4j database and hide all complexities under the hood.

Neo4jDocumentStore will store Documents as Graph nodes in Neo4j. Embeddings are stored as part of the node, but indexing and querying of vector embeddings using ANN is managed by a dedicated Vector Index.

                                   +-----------------------------+
                                   |       Neo4j Database        |
                                   +-----------------------------+
                                   |                             |
                                   |      +----------------+     |
                                   |      |    Document    |     |
                write_documents    |      +----------------+     |
          +------------------------+----->|   properties   |     |
          |                        |      |                |     |
+---------+----------+             |      |   embedding    |     |
|                    |             |      +--------+-------+     |
| Neo4jDocumentStore |             |               |             |
|                    |             |               |index/query  |
+---------+----------+             |               |             |
          |                        |      +--------+--------+    |
          |                        |      |  Vector Index   |    |
          +----------------------->|      |                 |    |
               query_embeddings    |      | (for embedding) |    |
                                   |      +-----------------+    |
                                   |                             |
                                   +-----------------------------+

In the above diagram:

  • Document is a Neo4j node (with "Document" label)
  • properties are Document attributes stored as part of the node.
  • embedding is also a property of the Document node (just shown separately in the diagram for clarity) which is a vector of type LIST[FLOAT].
  • Vector Index is where embeddings are getting indexed by Neo4j as soon as those are updated in Document nodes.

Installation

neo4j-haystack can be installed as any other Python library, using pip:

pip install --upgrade pip # optional
pip install neo4j-haystack

Usage

Once installed, you can start using Neo4jDocumentStore as any other document stores that support embeddings.

from neo4j_haystack import Neo4jDocumentStore

document_store = Neo4jDocumentStore(
    url="bolt://localhost:7687",
    username="neo4j",
    password="passw0rd",
    database="neo4j",
    embedding_dim=384,
    embedding_field="embedding",
    index="document-embeddings", # The name of the Vector Index in Neo4j
    node_label="Document", # Providing a label to Neo4j nodes which store Documents
)

Assuming there is a list of documents available you can write/index those in Neo4j, e.g.:

documents: List[Document] = ...
document_store.write_documents(documents)

The full list of parameters accepted by Neo4jDocumentStore can be found in API documentation.

Please notice you will need to have a running instance of Neo4j database (in-memory version of Neo4j is not supported). There are several options available:

  • Docker, other options available in the same Operations Manual
  • AuraDB - a fully managed Cloud Instance of Neo4j
  • Neo4j Desktop client application

The simplest way to start database locally will be with Docker container:

docker run \
    --restart always \
    --publish=7474:7474 --publish=7687:7687 \
    --env NEO4J_AUTH=neo4j/passw0rd \
    neo4j:5.15.0

Retrieving documents

Neo4jEmbeddingRetriever component can be used to retrieve documents from Neo4j by querying vector index using an embedded query. Below is a pipeline which finds documents using query embedding s well as metadata filtering:

from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from neo4j_haystack import Neo4jEmbeddingRetriever, Neo4jDocumentStore

model_name = "sentence-transformers/all-MiniLM-L6-v2"

document_store = Neo4jDocumentStore(
    url="bolt://localhost:7687",
    username="neo4j",
    password="passw0rd",
    database="neo4j",
    embedding_dim=384,
    index="document-embeddings",
)

pipeline = Pipeline()
pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model_name_or_path=model_name))
pipeline.add_component("retriever", Neo4jEmbeddingRetriever(document_store=document_store))
pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

result = pipeline.run(
    data={
        "text_embedder": {"text": "Query to be embedded"},
        "retriever": {
            "top_k": 5,
            "filters": {"field": "release_date", "operator": "==", "value": "2018-12-09"},
        },
    }
)

documents: List[Document] = result["retriever"]["documents"]

Retrieving documents using Cypher

Neo4jDynamicDocumentRetriever is a flexible retriever component which can run a Cypher query to obtain documents. The above example of Neo4jEmbeddingRetriever could be rewritten without usage of Neo4jDocumentStore:

from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder

from neo4j_haystack import Neo4jClientConfig, Neo4jDynamicDocumentRetriever

client_config = Neo4jClientConfig(
    url="bolt://localhost:7687",
    username="neo4j",
    password="passw0rd",
    database="neo4j",
)

cypher_query = """
            CALL db.index.vector.queryNodes($index, $top_k, $query_embedding)
            YIELD node as doc, score
            MATCH (doc) WHERE doc.release_date = $release_date
            RETURN doc{.*, score}, score
            ORDER BY score DESC LIMIT $top_k
        """

embedder = SentenceTransformersTextEmbedder(model_name_or_path="sentence-transformers/all-MiniLM-L6-v2")
retriever = Neo4jDynamicDocumentRetriever(
    client_config=client_config, runtime_parameters=["query_embedding"], doc_node_name="doc"
)

pipeline = Pipeline()
pipeline.add_component("text_embedder", embedder)
pipeline.add_component("retriever", retriever)
pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

result = pipeline.run(
    data={
        "text_embedder": {"text": "Query to be embedded"},
        "retriever": {
            "query": cypher_query,
            "parameters": {"index": "document-embeddings", "top_k": 5, "release_date": "2018-12-09"},
        },
    }
)

documents: List[Document] = result["retriever"]["documents"]

Please notice how query parameters are being used in the cypher_query:

  • runtime_parameters is a list of parameter names which are going to be input slots when connecting components in a pipeline. In our case query_embedding input is connected to the text_embedder.embedding output.
  • pipeline.run specifies additional parameters to the retriever component which can be referenced in the cypher_query, e.g. top_k.

License

neo4j-haystack is distributed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neo4j_haystack-2.0.1.tar.gz (578.8 kB view hashes)

Uploaded Source

Built Distribution

neo4j_haystack-2.0.1-py3-none-any.whl (38.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page