Skip to main content

DataStax RAGStack Knowledge Store

Project description

RAGStack Knowledge Store

Hybrid Knowledge Store combining vector similarity and edges between chunks.

Usage

  1. Pre-process your documents to populate metadata information.
  2. Create a Hybrid KnowledgeStore and add your LangChain Documents.
  3. Retrieve documents from the KnowledgeStore.

Populate Metadata

The Knowledge Store makes use of the following metadata fields on each Document:

  • content_id: If assigned, this specifies the unique ID of the Document. If not assigned, one will be generated. This should be set if you may re-ingest the same document so that it is overwritten rather than being duplicated.
  • parent_content_id: If this Document is a chunk of a larger document, you may reference the parent content here.
  • keywords: A list of strings representing keywords present in this Document.
  • hrefs: A list of strings containing the URLs which this Document links to.
  • urls: A list of strings containing the URLs associated with this Document. If one webpage is divided into multiple chunks, each chunk's Document would have the same URL. One webpage may have multiple URLs if it is available in multiple ways.

Keywords

To link documents with common keywords, assign the keywords metadata of each Document.

There are various ways to assign keywords to each Document, such as TF-IDF across the documents. One easy option is to use the KeyBERT.

Once installed with pip install keybert, you can add keywords to a list documents as follows:

from keybert import KeyBERT

kw_model = KeyBERT()
keywords = kw_model.extract_keywords([doc.page_content for doc in pages],
                                     stop_words='english')

for (doc, kws) in zip(documents, keywords):
    doc.metadata["keywords"] = [kw for (kw, _distance) in kws]

Rather than taking all the top keywords, you could also limit to those with less than a certain _distance to the document.

Hyperlinks

To capture hyperlinks, populate the hrefs and urls metadata fields of each Document.

import re
link_re = re.compile("href=\"([^\"]+)")
for doc in documents:
    doc.metadata["content_id"] = doc.metadata["source"]
    doc.metadata["hrefs"] = list(link_re.findall(doc.page_content))
    doc.metadata["urls"] = [doc.metadata["source"]]

Store

import cassio
from langchain_openai import OpenAIEmbeddings
from ragstack_knowledge_store import KnowledgeStore

cassio.init(auto=True)

knowledge_store = KnowledgeStore(embeddings=OpenAIEmbeddings())

# Store the documents
knowledge_store.add_documents(documents)

Retrieve

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

# Retrieve and generate using the relevant snippets of the blog.
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

# Depth 0 - don't traverse edges. equivalent to vector-only.
# Depth 1 - vector search plus 1 level of edges
retriever = knowledge_store.as_retriever(k=4, depth=1)

template = """You are a helpful technical support bot. You should provide complete answers explaining the options the user has available to address their problem. Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs):
    formatted = "\n\n".join(f"From {doc.metadata['content_id']}: {doc.page_content}" for doc in docs)
    return formatted


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Development

poetry install --with=dev

# Run Tests
poetry run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragstack_ai_knowledge_store-0.0.3.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragstack_ai_knowledge_store-0.0.3-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file ragstack_ai_knowledge_store-0.0.3.tar.gz.

File metadata

File hashes

Hashes for ragstack_ai_knowledge_store-0.0.3.tar.gz
Algorithm Hash digest
SHA256 be024ee11faf9f38c7dcda457b7200818e3f6adab983f15b41358e299f8f5d3f
MD5 a42d930cbcf8c4e16ffb12fed6b90736
BLAKE2b-256 ad3ecacac046da16f25a08a4a2171dec007187f511f21fcd858aa00f80217a41

See more details on using hashes here.

File details

Details for the file ragstack_ai_knowledge_store-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for ragstack_ai_knowledge_store-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 92ab6b70b5db545f6dc583c6c5ca2c39a00e60b8742e92b899c1b05694be0230
MD5 37e3c31b70fa0b23365e4e9f6a958857
BLAKE2b-256 8b81e2aace9ee5554e3659aefcca505cc851c28403535fff5f4c516573c95df2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page