A library for building Retrieval-Augmented Generation pipelines.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

RAG Toolkit

The RAG Toolkit is a library designed to streamline the creation of Retrieval-Augmented Generation (RAG) pipelines. It provides utilities for document processing, vector-based retrieval, query routing, and integration with large language models (LLMs). This toolkit simplifies the development of RAG-based systems, enabling developers to focus on solving real-world problems.

Features

Data Loading: Extract and process data from PDF files.
Base Retrieval: Efficient retrieval setup using embeddings for document search.
Retrieval Strategies: Support for various retrieval strategies, including StepBack, Fusion, and more.
Generation Strategies: Flexible response generation methods, including Recursive and HyDE.
Query Routing: Route user queries dynamically based on routing logic or templates.
Customizable Templates: Predefined or user-defined templates for task-specific use cases.
Pipeline Integration: Combines retrieval and generation into a single, streamlined pipeline.

Installation

To install the RAG Toolkit, clone the repository or install it directly from PyPI.

Clone the Repository

git clone https://github.com/youssef-yasser-ali/rag-toolkit.git
cd rag-toolkit
pip install .

Install via PyPI

pip install rag-toolkit

Configuration

use config/config.py ( optional ) to manage model names and API keys.

# Example config
GENRATIVE_MODEL = "gpt-3.5-turbo"
EMBEDDING_MODEL = "text-embedding-ada-002"

def get_generator_api_key():
    return "your-generator-api-key"

def get_query_gen_api_key():
    return "your-query-generator-api-key"

def get_embedding_api_key():
    return "your-embedding-api-key"

Quickstart Guide

1. Initialize Models

from rag_toolkit.google_models import initialize_llm, initialize_embedding

# your configration
from config.config import get_generator_api_key, get_query_gen_api_key, get_embedding_api_key, GENRATIVE_MODEL, EMBEDDING_MODEL

# Initialize models
retrieval_llm = initialize_llm(model_name=GENRATIVE_MODEL, api_key=get_query_gen_api_key())
generation_llm = initialize_llm(model_name=GENRATIVE_MODEL, api_key=get_generator_api_key())
embedding_llm = initialize_embedding(model_name=EMBEDDING_MODEL, api_key=get_embedding_api_key())

2. Load and Process Documents

from rag_toolkit.data_loader import load_pdf_pages

file_path = './data/raw/your_data.pdf'
documents = load_pdf_pages(file_path=file_path, start_page=1, end_page=20)

3. Create a Vector Store Retriever

from rag_toolkit.vector_store import create_vector_store_retriever

retriever = create_vector_store_retriever(documents=documents, embeddings_model=embedding_llm)

4. Retrieval Strategy Setup

After setting up the base retrieval, you can define more advanced retrieval strategies. The Retrieval Strategy setup defines how to enhance document retrieval accuracy and optimize the search process based on the user's needs.

Available Retrieval Strategies

Simple Retrieval: Basic retrieval using embeddings to find the most relevant documents based on similarity to the input query.
Multi-Query Generation Retrieval: Generates multiple variations of the original query to improve coverage and ensure more diverse results.
Fusion Generation Retrieval: Combines results from different retrieval methods to increase accuracy by fusing different document retrieval outputs.
Decomposition Retrieval: Breaks down a complex query into smaller, simpler sub-queries to improve the effectiveness of document retrieval.
Step-Back Retrieval: Refines the query iteratively by paraphrasing it into a more general form to improve accuracy and retrieve relevant documents.
HyDE Retrieval: Adapts dynamically to the complexity of the query by generating a more detailed, scientifically-based answer using a relevant passage from research.

from rag_toolkit.retriever import StepBackRetriever

retrieval_strategy = StepBackRetriever(model=retrieval_llm, base_retriever=retriever, template=None)

5. Generation Strategy Setup

Once the retrieval strategy is set up, configure the Generation Strategy. This step defines how the system generates context-aware responses based on the retrieved documents.

Available Generation Strategies

Here’s a concise summary of each generator:

SimpleGenerator: Generates answers using a basic retrieval approach and a single context.
MultiQueryGenerator: Uses multiple queries to retrieve diverse contexts, improving answer depth.
FusionGenerator: Combines multiple retrieval strategies, such as reciprocal rank fusion, to enhance answer quality.
RecursiveGenerator: Iteratively generates sub-questions and refines answers by using previous question-answer pairs and additional context.
IndividualGenerator: Synthesizes answers from individual question-answer pairs for more comprehensive responses.
StepBackGenerator: Considers both normal and "step-back" contexts to provide comprehensive, contextually relevant answers.
HyDEGenerator: Uses detailed context to improve the generation of answers, ensuring relevance and accuracy.

from rag_toolkit.generator import StepBackGenerator

generation_strategy = StepBackGenerator(model=generation_llm, template=None)

6. Define RAG Pipeline

build your pipeline :

from rag_toolkit.pipeline import RagPipeline

rag_pipeline = RagPipeline(retrieval=retrieval_strategy, generator=generation_strategy)

7. Process Queries

query = "What's ML?"
result = rag_pipeline.process(query=query)
print(result)

Examples

The examples/ directory contains sample scripts to help you get started with the toolkit:

example_pipeline: A basic example of a RAG pipeline for question-answering.
routing_example: Example of routing queries based on the context.
customize_template: How to use custom templates for retrieval and generation.

Run these examples using:

python -m examples.example_pipeline

Dependencies

The RAG Toolkit requires the following Python libraries:

langchain
langsmith
chromadb
pydantic

Install the required dependencies with:

pip install -r requirements.txt

Contributing

We welcome contributions to the RAG Toolkit! To contribute:

Fork the repository.
Create a new branch for your feature or bug fix.
Commit your changes.
Submit a pull request.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for more information.

Contact

For questions or support, feel free to reach out:

Email: yyasser849@gemail.com
GitHub: youssef-yasser-ali

Happy Coding!

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.1

Dec 16, 2024

0.1.0

Dec 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag-toolkit-0.1.1.tar.gz (15.5 kB view details)

Uploaded Dec 16, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_toolkit-0.1.1-py3-none-any.whl (15.6 kB view details)

Uploaded Dec 16, 2024 Python 3

File details

Details for the file rag-toolkit-0.1.1.tar.gz.

File metadata

Download URL: rag-toolkit-0.1.1.tar.gz
Upload date: Dec 16, 2024
Size: 15.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.9.16

File hashes

Hashes for rag-toolkit-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`00a0003586d5e7a5f6388be77cc4b2c0d53609401e4b7892dcf2b746e3bcfafe`
MD5	`6fb36bb4551bb7b08e980027697f4740`
BLAKE2b-256	`3336f0a4872f8ffba2288b10cf4b0daac019a8535f1d6323c14c4967529910e2`

See more details on using hashes here.

File details

Details for the file rag_toolkit-0.1.1-py3-none-any.whl.

File metadata

Download URL: rag_toolkit-0.1.1-py3-none-any.whl
Upload date: Dec 16, 2024
Size: 15.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.9.16

File hashes

Hashes for rag_toolkit-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a73a08b53c83c56efb14e787bc16ce97bfa050a68b4d378622e567c8a00a4195`
MD5	`39dae7b4e88a9cc2828cb191bf4cb03f`
BLAKE2b-256	`a78f5b754962a686c7faf6b69c27ae8835e6d20f04c1e5a81022586597e15431`

See more details on using hashes here.

rag-toolkit 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RAG Toolkit

Features

Installation

Clone the Repository

Install via PyPI

Configuration

Quickstart Guide

1. Initialize Models

2. Load and Process Documents

3. Create a Vector Store Retriever

4. Retrieval Strategy Setup

Available Retrieval Strategies

5. Generation Strategy Setup

Available Generation Strategies

6. Define RAG Pipeline

7. Process Queries

Examples

Dependencies

Contributing

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes