Skip to main content

Retrive PDF files context for your LLMs

Project description

RAGPDF

A Python package for Retrieval-Augmented Generation (RAG) using PDFs. RAGPDF makes it easy to extract, embed, and query content from PDF documents using modern language models.

Features

  • Easy to Use: Simple API for adding PDFs and querying their content
  • PDF Processing: Automatic text extraction and chunking from PDF documents
  • Vector Search: Fast similarity search using FAISS
  • Async Support: Built with asyncio for high performance
  • LLM Integration: Seamless integration with various LLM providers through litellm
  • Configurable: Flexible configuration for embedding and LLM models
  • Persistent Storage: Optional FAISS index persistence
  • Context Inspection: Access and analyze intermediate context for better control

Installation

pip install ragpdf

Quick Start

import asyncio
from ragpdf import RAGPDF, EmbeddingConfig, LLMConfig

# Configure your models
embedding_config = EmbeddingConfig(
    model="text-embedding-ada-002",  # OpenAI embedding model
    api_key="your-api-key",
    api_base="https://api.openai.com/v1"  # Optional: default OpenAI base URL
)

llm_config = LLMConfig(
    model="gpt-3.5-turbo",  # OpenAI chat model
    api_key="your-api-key",
    api_base="https://api.openai.com/v1",  # Optional: default OpenAI base URL
    temperature=0.7
)

# Create RAGPDF instance
rag = RAGPDF(embedding_config, llm_config)

async def main():
    # Add a PDF
    await rag.add("document.pdf")
    
    # Get and inspect context
    context = await rag.context("What is this document about?")
    
    # View context in different formats
    print("\nFormatted context:")
    print(context.to_string())  # Human-readable format
    
    print("\nJSON format for detailed inspection:")
    print(context.to_json())    # Structured format for analysis
    
    # Use the context for chat
    response = await rag.chat("Summarize the key points")
    print("\nAI Response:")
    print(response)

if __name__ == "__main__":
    asyncio.run(main())

Context Inspection

RAGPDF provides powerful context inspection capabilities, allowing you to examine and validate the intermediate context used for RAG. This is particularly useful during development and debugging.

RAGContext Class

class RAGContext:
    """Context information for RAG operations."""
    query: str           # Original query
    chunks: List[DocumentChunk]  # Retrieved text chunks
    files: List[str]     # Source PDF files
    total_chunks: int    # Total chunks found
    
    def to_string(self) -> str:
        """Convert context to human-readable format."""
        # Example output:
        # Query: What is the main topic?
        # Found 3 relevant chunks from 2 files:
        # document1.pdf, document2.pdf
        #
        # From document1.pdf (page 1):
        # [chunk content...]
    
    def to_json(self) -> str:
        """Convert context to JSON for detailed analysis."""
        # Returns structured JSON with all context details

Development Workflow

async def development_workflow():
    rag = RAGPDF(embedding_config, llm_config)
    await rag.add("document.pdf")
    
    # 1. Inspect retrieved context
    context = await rag.context("What is the main topic?")
    
    # Check which files were used
    print(f"Retrieved chunks from: {context.files}")
    
    # Examine individual chunks
    for chunk in context.chunks:
        print(f"\nFrom {chunk.file}" + 
              (f" (page {chunk.page})" if chunk.page else ""))
        print(chunk.content)
    
    # 2. Validate context quality
    if not any("relevant keyword" in chunk.content 
               for chunk in context.chunks):
        print("Warning: Expected content not found in context")
    
    # 3. Generate response with validated context
    response = await rag.chat("What is the main topic?")
    print("\nAI Response:", response)

Context Analysis Examples

async def analyze_context():
    rag = RAGPDF(embedding_config, llm_config)
    
    # Add multiple PDFs
    for pdf in ["doc1.pdf", "doc2.pdf"]:
        await rag.add(pdf)
    
    # Get context for analysis
    context = await rag.context("What are the key findings?")
    
    # 1. Source distribution analysis
    file_distribution = {}
    for chunk in context.chunks:
        file_distribution[chunk.file] = file_distribution.get(chunk.file, 0) + 1
    
    print("\nChunk distribution across files:")
    for file, count in file_distribution.items():
        print(f"{file}: {count} chunks")
    
    # 2. Content relevance check
    query_terms = set(context.query.lower().split())
    relevant_chunks = []
    
    for chunk in context.chunks:
        chunk_terms = set(chunk.content.lower().split())
        overlap = len(query_terms & chunk_terms)
        relevant_chunks.append({
            'file': chunk.file,
            'page': chunk.page,
            'term_overlap': overlap
        })
    
    print("\nChunk relevance analysis:")
    for chunk in sorted(relevant_chunks, 
                       key=lambda x: x['term_overlap'], 
                       reverse=True):
        print(f"File: {chunk['file']}, "
              f"Page: {chunk['page']}, "
              f"Term overlap: {chunk['term_overlap']}")

Model Configuration

RAGPDF uses litellm under the hood, making it compatible with any LLM provider supported by litellm. The model name and configuration must follow litellm's format.

OpenAI

# OpenAI API
config = LLMConfig(
    model="gpt-3.5-turbo",
    api_key="your-openai-key",
    api_base="https://api.openai.com/v1"  # Default OpenAI base URL
)

# Azure OpenAI
config = LLMConfig(
    model="azure/gpt-35-turbo",  # Prefix with 'azure/'
    api_key="your-azure-key",
    api_base="https://your-endpoint.openai.azure.com"
)

Anthropic

config = LLMConfig(
    model="claude-2",
    api_key="your-anthropic-key",
    api_base="https://api.anthropic.com"  # Default Anthropic base URL
)

Google

config = LLMConfig(
    model="gemini/gemini-pro",  # Prefix with 'gemini/'
    api_key="your-google-key",
    api_base="https://generativelanguage.googleapis.com"
)

Ollama

config = LLMConfig(
    model="ollama/llama2",  # Prefix with 'ollama/'
    api_base="http://localhost:11434"  # Local Ollama server
)

Custom Endpoints

# Self-hosted LLM API
config = LLMConfig(
    model="your-model-name",
    api_base="http://your-custom-endpoint:8000/v1",
    api_key="optional-key"  # Optional for self-hosted
)

Environment Variables

RAGPDF supports configuration through environment variables. The api_base is optional and defaults to the provider's standard endpoint:

# OpenAI
EMBEDDING_MODEL=text-embedding-ada-002
EMBEDDING_API_KEY=your-openai-key
EMBEDDING_BASE_URL=https://api.openai.com/v1

LLM_MODEL=gpt-3.5-turbo
LLM_API_KEY=your-openai-key
LLM_BASE_URL=https://api.openai.com/v1

# Azure OpenAI
LLM_MODEL=azure/gpt-35-turbo
LLM_API_KEY=your-azure-key
LLM_BASE_URL=https://your-endpoint.openai.azure.com

# Anthropic
LLM_MODEL=claude-2
LLM_API_KEY=your-anthropic-key
LLM_BASE_URL=https://api.anthropic.com

# Google
LLM_MODEL=gemini/gemini-pro
LLM_API_KEY=your-google-key
LLM_BASE_URL=https://generativelanguage.googleapis.com

# Ollama
LLM_MODEL=ollama/llama2
LLM_BASE_URL=http://localhost:11434

API Reference

RAGPDF Class

class RAGPDF:
    def __init__(self, 
                 embedding_config: Union[Dict[str, Any], EmbeddingConfig],
                 llm_config: Optional[Union[Dict[str, Any], LLMConfig]] = None,
                 index_path: Optional[str] = None):
        """Initialize RAGPDF with embedding and LLM configurations."""

    async def add(self, pdf_path: str) -> None:
        """Add a PDF document to the system."""

    async def context(self, query: str, k: int = 5) -> RAGContext:
        """Get relevant context for a query."""

    async def chat(self, prompt: str, k: int = 5, stream: bool = False) -> Union[str, AsyncIterator[str]]:
        """Generate a response using the LLM based on context."""

Configuration Models

class BaseConfig:
    """Base configuration for API models."""
    model: str           # Model name (litellm compatible)
    api_key: str = ""   # API key (optional)
    api_base: str = None # API base URL (optional)

class EmbeddingConfig(BaseConfig):
    """Configuration for embedding model."""
    pass

class LLMConfig(BaseConfig):
    """Configuration for language model."""
    temperature: float = 0.7  # Response temperature (optional)
    max_tokens: int = None   # Maximum response length (optional)

Examples

Using Different LLM Providers

# OpenAI
rag = RAGPDF(
    embedding_config=EmbeddingConfig(
        model="text-embedding-ada-002",
        api_key="your-openai-key"
    ),
    llm_config=LLMConfig(
        model="gpt-3.5-turbo",
        api_key="your-openai-key"
    )
)

# Ollama (local)
rag = RAGPDF(
    embedding_config=EmbeddingConfig(
        model="ollama/nomic-embed-text",
        api_base="http://localhost:11434"
    ),
    llm_config=LLMConfig(
        model="ollama/llama2",
        api_base="http://localhost:11434"
    )
)

# Azure OpenAI
rag = RAGPDF(
    embedding_config=EmbeddingConfig(
        model="azure/text-embedding-ada-002",
        api_key="your-azure-key",
        api_base="https://your-endpoint.openai.azure.com"
    ),
    llm_config=LLMConfig(
        model="azure/gpt-35-turbo",
        api_key="your-azure-key",
        api_base="https://your-endpoint.openai.azure.com"
    )
)

Persistent Storage

# Initialize with index storage
rag = RAGPDF(
    embedding_config=embedding_config,
    llm_config=llm_config,
    index_path="data/faiss_index.bin"
)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragpdf-0.1.2.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragpdf-0.1.2-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file ragpdf-0.1.2.tar.gz.

File metadata

  • Download URL: ragpdf-0.1.2.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for ragpdf-0.1.2.tar.gz
Algorithm Hash digest
SHA256 9444475ccbf9589dc422199943267f035df7d2c3a3f97f2d5a98ef1c1fe7d7a0
MD5 e6a1484e36bad23f18a7d604341d1d0a
BLAKE2b-256 e3f34a6d79b3aa9fbb052ce56ac7bf4db5c39ee0825a52d26002a5c8d9ac06e7

See more details on using hashes here.

File details

Details for the file ragpdf-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: ragpdf-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for ragpdf-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b97c06f5b0c09f12dc3c0f67de1f2feb5a98fe48343c9b7a639ec7956cd52c32
MD5 d590da8c73bee49cad578fa2ff098574
BLAKE2b-256 1bee61bc6bf710e0ef18c7e105ddcbf0ac84775ff75ff600de1c50c3a0c48e12

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page