Skip to main content

A set of helper classes that abstract some of the more common tasks of a typical RAG process including document loading/web scraping.

Project description

Ragdoll

PyPI version

🧭 Project Overview

This project provides a set of helper classes that abstract some of the more common tasks of a typical RAG process including document loading/web scraping.

It's based on local vector storage but can easily be extended to Pinecone using langchain.

the default LLM and embedding model is OpenAI but there are also options to run a fully local LLM.

🚧 Prerequisites

🎛 Project Setup

To set up the project on your local machine, follow these steps:

pip install python-ragdoll

or to get the latest build:

  1. Clone the repository to your local machine.
  2. Install the required dependencies using pip install -r requirements.txt.

alternatively, install with pip:

pip install git+https://github.com/nsasto/RAGdoll.git

📦 Project Structure

The project is structured as follows:

├── ragdoll_example.ipynb           # demo notebook.
├── ragdoll/                        # ragdoll files
├── README.md                       # This file.
├── requirements.txt                # List of dependencies.
└── img/                            # banner image above

🗄️ Data

The vector data used in this project is stored locally which is used to generate responses in the LLM Chat using a Retrieval Augmentation process. Be aware that if you are using OpenAI as your embeddings engine, that data will be sent to OpenAI.

Getting Started

Assumes you have the appropriate API keys for Google search and OpenAI in your environment variables or .env file. To load

from dotenv import load_dotenv
load_dotenv(override=True)

The super rapid version. 5 lines to build research and response generation:

from ragdoll.index import RagdollIndex
from ragdoll.retriever import RagdollRetriever

index= RagdollIndex()
ragdoll = RagdollRetriever()

#ok, let's go
question = "tell me more about langchain"
split_docs = index.run_index_pipeline(question)
retriever = ragdoll.get_compression_retriever(retriever)
response = ragdoll.answer_me_this(question, cc_retriever)
print(response)

generates the following structured response (snippet included here only) :

LangChain is an artificial intelligence framework designed for programmers to develop applications using large language models. It offers several key features that make it versatile and useful for developers.

One of the main features of LangChain is its context-awareness capability. It allows applications to establish connections between a language model and various context sources. This means that developers can create applications that are aware of the context in which they are being used, making them more intelligent and responsive....

1. Create an Index from web content

from ragdoll.index import RagdollIndex
index= RagdollIndex()

question = "tell me more about langchain"
#get appropriate search queries for the question 
search_queries = index.get_suggested_search_terms(question)
#get google search results
results=index.get_search_results(search_queries)
#scrape the returned sites and return documents. 
# results contains a little more metadata, the list of urls can be accessed via index.url_list which is used by default in the next call
documents = index.get_scraped_content()
#split docs
split_docs = index.get_split_documents(documents)

Or, in one line as follows:

split_docs = index.run_index_pipeline(question)

2. Retrieval

And that's pretty much it to load up our documents. To retrieve them using a langchain retriever is just as simple.

from ragdoll.retriever import RagdollRetriever

ragdoll = RagdollRetriever()
retriever = ragdoll.get_retriever(documents=split_docs) 
docs = retriever.get_relevant_documents('how does langchain work')

from ragdoll.helpers import pretty_print_docs
print("-" * 100)
print(f"The retriever had found {len(docs)} relevant documents")
print("-" * 100, "\n\n")
print(pretty_print_docs(docs, for_llm=False))

To use multi-query retrieval, use get_mq_retriever. Note that multi query will incur additional calls to your LLM. The Ragdoll MultiQuery class is a custom langchain retriever to resolve the native langchain bug as at version '0.1.6'.

retriever = ragdoll.get_mq_retriever(documents=split_docs) 

To use the Contextual Compression Retriever, you’ll need a base retriever (either the standard or multi query) - and then select the pipeline options which are all set to True by default but can be amended in the config params. The Contextual Compressor by default this refinement process: embeddings_filter > splitter > redundant_filter > relevance_filter

cc_retriever = ragdoll.get_compression_retriever(retriever)

3. Q&A

Basic Q&A is pretty straight forward. Simply pass your question to the answer_me_this method:

response = ragdoll.answer_me_this(question, cc_retriever)
print(response)

📚 References

The following resources were used in the development of this project:

🤝 Contributions

This project is a work in progress and there's plenty room for improvement - contributions are always welcome! If you have any ideas or suggestions, feel free to open an issue or submit a pull request.

🛡️ Disclaimer

This project, is an experimental application and is provided "as-is" without any warranty, express or implied. Code is shared for educational purposes under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_ragdoll-1.2.0.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

python_ragdoll-1.2.0-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file python_ragdoll-1.2.0.tar.gz.

File metadata

  • Download URL: python_ragdoll-1.2.0.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for python_ragdoll-1.2.0.tar.gz
Algorithm Hash digest
SHA256 2d98ca6ea991eee432619196814cb74b4b4a220893a8df5fc858d33dc29c54a5
MD5 a84d4e6a13f41571f574677c8a35c91d
BLAKE2b-256 622310c1ec39fb6897e53bb899c34c4c34fe24f7719b47c30ea7bc62f602fedb

See more details on using hashes here.

File details

Details for the file python_ragdoll-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: python_ragdoll-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 24.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for python_ragdoll-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 488efc76fddd2d1a126373d0ec3cd0c113f1fa7f8ac3bd68f7de689cf6033ae4
MD5 581d28986bb9c08119835308ae9558f6
BLAKE2b-256 5430618fbcbd09d357c4cc551ececd856090c644360d8c7a0b2be0d859e02c95

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page