Skip to main content

RAG (Retrieval-Augmented Generation) pipeline implementation

Project description

release GitHub Release Libraries.io dependency status for GitHub repo


RAG Pipeline

A powerful Retrieval-Augmented Generation (RAG) pipeline implementation that supports multiple document types, embedding models, and LLM providers.

Features

  • Multiple Document Support

    • PDF documents
    • Text files
    • HTML content
  • Flexible LLM Integration

    • Ollama support (local models)
    • Hugging Face integration
    • Configurable model parameters
  • Advanced Document Processing

    • Metadata extraction from file paths
    • Configurable text chunking
    • Duplicate detection
    • Progress tracking
  • Vector Store Features

    • ChromaDB integration
    • Configurable embedding models
    • Multiple search strategies
    • Persistent storage

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/rag-pipeline.git
cd rag-pipeline
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
# install from source
pip install -e .
# or install latest release from PyPI
pip install rag-agent
  1. Install Ollama (if using local models):
curl -fsSL https://ollama.com/install.sh | sh

Configuration

Create a .env file in the project root with your configuration:

# Document source settings
PIPELINE_SOURCES='{"/path/to/your/documents": ["pdf", "mhtml"]}'

# Text splitting settings
PIPELINE_CHUNK_SIZE=1000
PIPELINE_CHUNK_OVERLAP=200

# Vector store settings
PIPELINE_PERSIST_DIRECTORY=chroma_db
PIPELINE_COLLECTION_NAME=default_collection

# Embedding model settings
PIPELINE_EMBEDDING_MODEL=all-MiniLM-L6-v2
PIPELINE_EMBEDDING_MODEL_KWARGS={"device": "cuda"}

# LLM settings
PIPELINE_LLM_PROVIDER=ollama  # or 'huggingface'
PIPELINE_LLM_MODEL=mistral
PIPELINE_LLM_MODEL_KWARGS={"temperature": 0.3}
PIPELINE_LLM_API_KEY=your_api_key  # Required for Hugging Face

# Retrieval settings
PIPELINE_SEARCH_TYPE=similarity  # or 'mmr', 'similarity_score_threshold'
PIPELINE_K=5
PIPELINE_SCORE_THRESHOLD=0.5
PIPELINE_FETCH_K=20
PIPELINE_LAMBDA_MULT=0.5

Refer to pipeline config for more information.

Usage

Basic Usage

from rag_agent.pipeline import RAGPipeline, Settings

async def main():
    # Initialize with default settings
    async with RAGPipeline() as pipeline:
        # Load and process documents
        documents = await pipeline.load_documents()
        processed_docs = await pipeline.process_documents(documents)
        await pipeline.update_vectorstore(processed_docs)
        
        # Setup and run query
        await pipeline.setup_retrieval_chain()
        answer = await pipeline.run("Your question here")
        print(answer)

# Run the pipeline
import asyncio
asyncio.run(main())

Web UI

UI screenshot

The project includes a modern web interface built with Panel that provides an interactive way to use the RAG pipeline:

Features

  • Interactive Chat Interface

    • Real-time question answering
    • Context toggle for viewing retrieved documents
    • Chat history saving and export
    • Copy and clear functionality
  • Configuration Management

    • Live configuration updates
    • Document source management
    • Embedding model settings
    • Text splitting parameters
    • Vector store configuration
    • Retrieval strategy options
    • LLM provider settings
  • User Experience

    • Responsive layout with collapsible sidebar
    • Real-time notifications
    • Progress tracking
    • Error handling with detailed feedback

Installation

Install the UI dependencies:

pip install 'rag-agent[ui]'

Usage

Start the web interface:

python -m rag_agent.ui

Or specify a custom port:

python -m rag_agent.ui --port 8502

The interface will be available at http://localhost:8501 (or your specified port).

Custom Configuration

config = Settings(
    pipeline_source="/path/to/documents",
    pipeline_source_type="pdf",
    pipeline_llm_provider="ollama",
    pipeline_llm_model="mixtral",
    pipeline_llm_model_kwargs={"temperature": 0.3}
)

async with RAGPipeline(config) as pipeline:
    # ... rest of the code

Supported Models

All Ollama Models

  • mistral (7B parameters)
  • mixtral (8x7B parameters)
  • llama2 (7B parameters)
  • codellama (Code specialized)
  • neural-chat (Chat optimized)
  • dolphin-mixtral (Chat optimized)
  • And many more...

Hugging Face Models (Experimental)

  • mistralai/Mistral-7B-Instruct-v0.2
  • meta-llama/Llama-2-7b-chat-hf
  • malteos/gpt2-wechsel-german
  • And many more...

Document Processing

The pipeline supports various document types and processing options:

PDF Processing

  • Uses PyMuPDF for efficient PDF processing
  • Extracts text and metadata
  • Configurable processing modes

Text Processing

  • Configurable chunk sizes
  • Overlap control
  • Metadata extraction

HTML Processing

  • Clean HTML extraction
  • Structured content handling
  • Metadata preservation

iX Archive Scraper

The project includes a specialized scraper for the archive of iX magazine issues with the following features:

Features

  • Automated Article Download

    • Downloads articles in multiple formats (PDF, MHTML)
    • Preserves article metadata and structure
    • Handles authentication and session management
  • Parallel Processing

    • Concurrent article processing
    • Configurable thread pool
    • Progress tracking with tqdm
  • Robust Error Handling

    • Automatic retries for failed downloads
    • Graceful cleanup on interruption
    • Detailed logging

Configuration

Add the following to your .env file for IX scraper configuration:

# IX Scraper settings
IX_SCRAPER_BASE_URL=https://www.heise.de
IX_SCRAPER_SIGN_IN_URL=https://www.heise.de/sso/login/
IX_SCRAPER_ARCHIVE_URL=https://www.heise.de/select/ix/archiv/
IX_SCRAPER_MAX_THREADS=10
IX_SCRAPER_MAX_CONCURRENT=10
IX_SCRAPER_TIMEOUT=30
IX_SCRAPER_RETRY_ATTEMPTS=5
IX_SCRAPER_OUTPUT_DIR=~/Downloads/ix
IX_SCRAPER_USERNAME=your_username
IX_SCRAPER_PASSWORD=your_password
IX_SCRAPER_OVERWRITE=false
IX_SCRAPER_EXPORT_FORMATS=["pdf"]

Refer to iX scraper config for more information.

Usage

# pip install -e .[scraper]
from rag_agent.scrappers.ix import IXScraper, Settings

async def main():
    # Initialize with default settings
    async with IXScraper() as scraper:
        # Run the scraper to export each article separately using export configuration
        await scraper.run()

# Run the scraper
import asyncio
asyncio.run(main())

or

# pip install -e .[scraper]
from rag_agent.scrappers.ix import IXDownloader, Settings

async def main():
    # Initialize with default settings
    async with IXDownloader() as scraper:
        # Run the scraper to download the issues in PDF format
        await scraper.run()

# Run the scraper
import asyncio
asyncio.run(main())

Export Formats

The scraper supports multiple export formats:

  1. PDF Export

    • High-quality PDF output
    • Configurable page settings
    • Base64 encoded transfer
  2. MHTML Export

    • Preserves web page structure
    • Includes all resources
    • Suitable for archival

WebDriver Configuration

The scraper uses Selenium WebDriver with configurable options:

  • Headless mode support
  • Custom user agent
  • Resource optimization
  • Security settings

Vector Store

The pipeline uses ChromaDB for vector storage with features like:

  • Configurable embedding models
  • Multiple search strategies
  • Persistent storage
  • Duplicate detection

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • LangChain for the RAG framework
  • ChromaDB for vector storage
  • Ollama for local LLM support
  • Hugging Face for model hosting
  • iX publishers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_agent-0.3.1.tar.gz (33.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_agent-0.3.1-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file rag_agent-0.3.1.tar.gz.

File metadata

  • Download URL: rag_agent-0.3.1.tar.gz
  • Upload date:
  • Size: 33.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for rag_agent-0.3.1.tar.gz
Algorithm Hash digest
SHA256 7a515fc5139ad23e6f219537eb119cbda2e9cad6f457fc2d6e0fa254c0b5903a
MD5 75244dedce29f805eab564fbf9a60701
BLAKE2b-256 04885e5277d84e5659c58655f3de8033ef08468f99e6a2e95931c12149db6673

See more details on using hashes here.

File details

Details for the file rag_agent-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: rag_agent-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 37.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for rag_agent-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2b61057295f0f69bafcf3a4b78048fe097312889ebc9b476c88f4467fbce97ce
MD5 1b8058f5ffa0f64e4ba55dd3cc41099b
BLAKE2b-256 6160a716ea54355c78b42e7138f6ec86dfc48a199934cc1abeac26b95c41ec1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page