DataStax RAGStack Knowledge Store
Project description
RAGStack Knowledge Store
Hybrid Knowledge Store combining vector similarity and edges between chunks.
Usage
- Pre-process your documents to populate
metadatainformation. - Create a Hybrid
KnowledgeStoreand add your LangChainDocuments. - Retrieve documents from the
KnowledgeStore.
Populate Metadata
The Knowledge Store makes use of the following metadata fields on each Document:
content_id: If assigned, this specifies the unique ID of theDocument. If not assigned, one will be generated. This should be set if you may re-ingest the same document so that it is overwritten rather than being duplicated.parent_content_id: If thisDocumentis a chunk of a larger document, you may reference the parent content here.keywords: A list of strings representing keywords present in thisDocument.hrefs: A list of strings containing the URLs which thisDocumentlinks to.urls: A list of strings containing the URLs associated with thisDocument. If one webpage is divided into multiple chunks, each chunk'sDocumentwould have the same URL. One webpage may have multiple URLs if it is available in multiple ways.
Keywords
To link documents with common keywords, assign the keywords metadata of each Document.
There are various ways to assign keywords to each Document, such as TF-IDF across the documents.
One easy option is to use the KeyBERT.
Once installed with pip install keybert, you can add keywords to a list documents as follows:
from keybert import KeyBERT
kw_model = KeyBERT()
keywords = kw_model.extract_keywords([doc.page_content for doc in pages],
stop_words='english')
for (doc, kws) in zip(documents, keywords):
doc.metadata["keywords"] = [kw for (kw, _distance) in kws]
Rather than taking all the top keywords, you could also limit to those with less than a certain _distance to the document.
Hyperlinks
To capture hyperlinks, populate the hrefs and urls metadata fields of each Document.
import re
link_re = re.compile("href=\"([^\"]+)")
for doc in documents:
doc.metadata["content_id"] = doc.metadata["source"]
doc.metadata["hrefs"] = list(link_re.findall(doc.page_content))
doc.metadata["urls"] = [doc.metadata["source"]]
Store
import cassio
from langchain_openai import OpenAIEmbeddings
from ragstack_knowledge_store import KnowledgeStore
cassio.init(auto=True)
knowledge_store = KnowledgeStore(embeddings=OpenAIEmbeddings())
# Store the documents
knowledge_store.add_documents(documents)
Retrieve
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
# Retrieve and generate using the relevant snippets of the blog.
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
# Depth 0 - don't traverse edges. equivalent to vector-only.
# Depth 1 - vector search plus 1 level of edges
retriever = knowledge_store.as_retriever(k=4, depth=1)
template = """You are a helpful technical support bot. You should provide complete answers explaining the options the user has available to address their problem. Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
def format_docs(docs):
formatted = "\n\n".join(f"From {doc.metadata['content_id']}: {doc.page_content}" for doc in docs)
return formatted
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
Development
poetry install --with=dev
# Run Tests
poetry run pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragstack_ai_knowledge_store-0.0.3.tar.gz.
File metadata
- Download URL: ragstack_ai_knowledge_store-0.0.3.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be024ee11faf9f38c7dcda457b7200818e3f6adab983f15b41358e299f8f5d3f
|
|
| MD5 |
a42d930cbcf8c4e16ffb12fed6b90736
|
|
| BLAKE2b-256 |
ad3ecacac046da16f25a08a4a2171dec007187f511f21fcd858aa00f80217a41
|
File details
Details for the file ragstack_ai_knowledge_store-0.0.3-py3-none-any.whl.
File metadata
- Download URL: ragstack_ai_knowledge_store-0.0.3-py3-none-any.whl
- Upload date:
- Size: 17.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92ab6b70b5db545f6dc583c6c5ca2c39a00e60b8742e92b899c1b05694be0230
|
|
| MD5 |
37e3c31b70fa0b23365e4e9f6a958857
|
|
| BLAKE2b-256 |
8b81e2aace9ee5554e3659aefcca505cc851c28403535fff5f4c516573c95df2
|