A set of helper classes that abstract some of the more common tasks of a typical RAG process including document loading/web scraping.
Project description
🧭 Project Overview
This project provides a set of helper classes that abstract some of the more common tasks of a typical RAG process including document loading/web scraping.
It's based on local vector storage but can easily be extended to Pinecone using langchain.
the default LLM and embedding model is OpenAI but there are also options to run a fully local LLM.
🚧 Prerequisites
- OpenAI API Key - For more information on how to create an OpenAI API key, visit the OpenAI Platform Website
- Google API Keys - To set it up, create the GOOGLE_API_KEY in the Google Cloud credential console (https://console.cloud.google.com/apis/credentials) and a GOOGLE_CSE_ID using the Programmable Search Engine (https://programmablesearchengine.google.com/controlpanel/create).
🎛 Project Setup
To set up the project on your local machine, follow these steps:
pip install python-ragdoll
or to get the latest build:
- Clone the repository to your local machine.
- Install the required dependencies using
pip install -r requirements.txt.
alternatively, install with pip:
pip install git+https://github.com/nsasto/RAGdoll.git
📦 Project Structure
The project is structured as follows:
├── ragdoll_example.ipynb # demo notebook.
├── ragdoll/ # ragdoll files
├── README.md # This file.
├── requirements.txt # List of dependencies.
└── img/ # banner image above
🗄️ Data
The vector data used in this project is stored locally which is used to generate responses in the LLM Chat using a Retrieval Augmentation process. Be aware that if you are using OpenAI as your embeddings engine, that data will be sent to OpenAI.
Getting Started
Assumes you have the appropriate API keys for Google search and OpenAI in your environment variables or .env file. To load
from dotenv import load_dotenv
load_dotenv(override=True)
The super rapid version. 5 lines to build research and response generation:
from ragdoll.index import RagdollIndex
from ragdoll.retriever import RagdollRetriever
index= RagdollIndex()
ragdoll = RagdollRetriever()
#ok, let's go
question = "tell me more about langchain"
split_docs = index.run_index_pipeline(question)
retriever = ragdoll.get_compression_retriever(retriever)
response = ragdoll.answer_me_this(question, cc_retriever)
print(response)
generates the following structured response (snippet included here only) :
LangChain is an artificial intelligence framework designed for programmers to develop applications using large language models. It offers several key features that make it versatile and useful for developers.
One of the main features of LangChain is its context-awareness capability. It allows applications to establish connections between a language model and various context sources. This means that developers can create applications that are aware of the context in which they are being used, making them more intelligent and responsive....
1. Create an Index from web content
from ragdoll.index import RagdollIndex
index= RagdollIndex()
question = "tell me more about langchain"
#get appropriate search queries for the question
search_queries = index.get_suggested_search_terms(question)
#get google search results
results=index.get_search_results(search_queries)
#scrape the returned sites and return documents.
# results contains a little more metadata, the list of urls can be accessed via index.url_list which is used by default in the next call
documents = index.get_scraped_content()
#split docs
split_docs = index.get_split_documents(documents)
Or, in one line as follows:
split_docs = index.run_index_pipeline(question)
2. Retrieval
And that's pretty much it to load up our documents. To retrieve them using a langchain retriever is just as simple.
from ragdoll.retriever import RagdollRetriever
ragdoll = RagdollRetriever()
retriever = ragdoll.get_retriever(documents=split_docs)
docs = retriever.get_relevant_documents('how does langchain work')
from ragdoll.helpers import pretty_print_docs
print("-" * 100)
print(f"The retriever had found {len(docs)} relevant documents")
print("-" * 100, "\n\n")
print(pretty_print_docs(docs, for_llm=False))
To use multi-query retrieval, use get_mq_retriever. Note that multi query will incur additional calls to your LLM.
The Ragdoll MultiQuery class is a custom langchain retriever to resolve the native langchain bug as at version '0.1.6'.
retriever = ragdoll.get_mq_retriever(documents=split_docs)
To use the Contextual Compression Retriever, you’ll need a base retriever (either the standard or multi query) - and then select the pipeline options which are all set to True by default but can be amended in the config params. The Contextual Compressor by default this refinement process: embeddings_filter > splitter > redundant_filter > relevance_filter
cc_retriever = ragdoll.get_compression_retriever(retriever)
3. Q&A
Basic Q&A is pretty straight forward. Simply pass your question to the answer_me_this method:
response = ragdoll.answer_me_this(question, cc_retriever)
print(response)
📚 References
The following resources were used in the development of this project:
- Langchain: https://www.langchain.com/
- FAISS: https://github.com/facebookresearch/faiss
🤝 Contributions
This project is a work in progress and there's plenty room for improvement - contributions are always welcome! If you have any ideas or suggestions, feel free to open an issue or submit a pull request.
🛡️ Disclaimer
This project, is an experimental application and is provided "as-is" without any warranty, express or implied. Code is shared for educational purposes under the MIT license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file python_ragdoll-1.2.0.tar.gz.
File metadata
- Download URL: python_ragdoll-1.2.0.tar.gz
- Upload date:
- Size: 21.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d98ca6ea991eee432619196814cb74b4b4a220893a8df5fc858d33dc29c54a5
|
|
| MD5 |
a84d4e6a13f41571f574677c8a35c91d
|
|
| BLAKE2b-256 |
622310c1ec39fb6897e53bb899c34c4c34fe24f7719b47c30ea7bc62f602fedb
|
File details
Details for the file python_ragdoll-1.2.0-py3-none-any.whl.
File metadata
- Download URL: python_ragdoll-1.2.0-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
488efc76fddd2d1a126373d0ec3cd0c113f1fa7f8ac3bd68f7de689cf6033ae4
|
|
| MD5 |
581d28986bb9c08119835308ae9558f6
|
|
| BLAKE2b-256 |
5430618fbcbd09d357c4cc551ececd856090c644360d8c7a0b2be0d859e02c95
|