Retrive PDF files context for your LLMs

These details have not been verified by PyPI

Project links

Project description

RAGPDF

A Python package for Retrieval-Augmented Generation (RAG) using PDFs. RAGPDF makes it easy to extract, embed, and query content from PDF documents using modern language models.

Features

Easy to Use: Simple API for adding PDFs and querying their content
PDF Processing: Automatic text extraction and chunking from PDF documents
Vector Search: Fast similarity search using FAISS
Async Support: Built with asyncio for high performance
LLM Integration: Seamless integration with various LLM providers through litellm
Configurable: Flexible configuration for embedding and LLM models
Persistent Storage: Optional FAISS index persistence
Context Inspection: Access and analyze intermediate context for better control

Installation

pip install ragpdf

Quick Start

import asyncio
from ragpdf import RAGPDF, EmbeddingConfig, LLMConfig

# Configure your models
embedding_config = EmbeddingConfig(
    model="text-embedding-ada-002",  # OpenAI embedding model
    api_key="your-api-key",
    api_base="https://api.openai.com/v1"  # Optional: default OpenAI base URL
)

llm_config = LLMConfig(
    model="gpt-3.5-turbo",  # OpenAI chat model
    api_key="your-api-key",
    api_base="https://api.openai.com/v1",  # Optional: default OpenAI base URL
    temperature=0.7
)

# Create RAGPDF instance
rag = RAGPDF(embedding_config, llm_config)

async def main():
    # Add a PDF
    await rag.add("document.pdf")
    
    # Get and inspect context
    context = await rag.context("What is this document about?")
    
    # View context in different formats
    print("\nFormatted context:")
    print(context.to_string())  # Human-readable format
    
    print("\nJSON format for detailed inspection:")
    print(context.to_json())    # Structured format for analysis
    
    # Use the context for chat
    response = await rag.chat("Summarize the key points")
    print("\nAI Response:")
    print(response)

if __name__ == "__main__":
    asyncio.run(main())

Context Inspection

RAGPDF provides powerful context inspection capabilities, allowing you to examine and validate the intermediate context used for RAG. This is particularly useful during development and debugging.

RAGContext Class

class RAGContext:
    """Context information for RAG operations."""
    query: str           # Original query
    chunks: List[DocumentChunk]  # Retrieved text chunks
    files: List[str]     # Source PDF files
    total_chunks: int    # Total chunks found
    
    def to_string(self) -> str:
        """Convert context to human-readable format."""
        # Example output:
        # Query: What is the main topic?
        # Found 3 relevant chunks from 2 files:
        # document1.pdf, document2.pdf
        #
        # From document1.pdf (page 1):
        # [chunk content...]
    
    def to_json(self) -> str:
        """Convert context to JSON for detailed analysis."""
        # Returns structured JSON with all context details

Development Workflow

async def development_workflow():
    rag = RAGPDF(embedding_config, llm_config)
    await rag.add("document.pdf")
    
    # 1. Inspect retrieved context
    context = await rag.context("What is the main topic?")
    
    # Check which files were used
    print(f"Retrieved chunks from: {context.files}")
    
    # Examine individual chunks
    for chunk in context.chunks:
        print(f"\nFrom {chunk.file}" + 
              (f" (page {chunk.page})" if chunk.page else ""))
        print(chunk.content)
    
    # 2. Validate context quality
    if not any("relevant keyword" in chunk.content 
               for chunk in context.chunks):
        print("Warning: Expected content not found in context")
    
    # 3. Generate response with validated context
    response = await rag.chat("What is the main topic?")
    print("\nAI Response:", response)

Context Analysis Examples

async def analyze_context():
    rag = RAGPDF(embedding_config, llm_config)
    
    # Add multiple PDFs
    for pdf in ["doc1.pdf", "doc2.pdf"]:
        await rag.add(pdf)
    
    # Get context for analysis
    context = await rag.context("What are the key findings?")
    
    # 1. Source distribution analysis
    file_distribution = {}
    for chunk in context.chunks:
        file_distribution[chunk.file] = file_distribution.get(chunk.file, 0) + 1
    
    print("\nChunk distribution across files:")
    for file, count in file_distribution.items():
        print(f"{file}: {count} chunks")
    
    # 2. Content relevance check
    query_terms = set(context.query.lower().split())
    relevant_chunks = []
    
    for chunk in context.chunks:
        chunk_terms = set(chunk.content.lower().split())
        overlap = len(query_terms & chunk_terms)
        relevant_chunks.append({
            'file': chunk.file,
            'page': chunk.page,
            'term_overlap': overlap
        })
    
    print("\nChunk relevance analysis:")
    for chunk in sorted(relevant_chunks, 
                       key=lambda x: x['term_overlap'], 
                       reverse=True):
        print(f"File: {chunk['file']}, "
              f"Page: {chunk['page']}, "
              f"Term overlap: {chunk['term_overlap']}")

Model Configuration

RAGPDF uses litellm under the hood, making it compatible with any LLM provider supported by litellm. The model name and configuration must follow litellm's format.

OpenAI

# OpenAI API
config = LLMConfig(
    model="gpt-3.5-turbo",
    api_key="your-openai-key",
    api_base="https://api.openai.com/v1"  # Default OpenAI base URL
)

# Azure OpenAI
config = LLMConfig(
    model="azure/gpt-35-turbo",  # Prefix with 'azure/'
    api_key="your-azure-key",
    api_base="https://your-endpoint.openai.azure.com"
)

Anthropic

config = LLMConfig(
    model="claude-2",
    api_key="your-anthropic-key",
    api_base="https://api.anthropic.com"  # Default Anthropic base URL
)

Google

config = LLMConfig(
    model="gemini/gemini-pro",  # Prefix with 'gemini/'
    api_key="your-google-key",
    api_base="https://generativelanguage.googleapis.com"
)

Ollama

config = LLMConfig(
    model="ollama/llama2",  # Prefix with 'ollama/'
    api_base="http://localhost:11434"  # Local Ollama server
)

Custom Endpoints

# Self-hosted LLM API
config = LLMConfig(
    model="your-model-name",
    api_base="http://your-custom-endpoint:8000/v1",
    api_key="optional-key"  # Optional for self-hosted
)

Environment Variables

RAGPDF supports configuration through environment variables. The api_base is optional and defaults to the provider's standard endpoint:

# OpenAI
EMBEDDING_MODEL=text-embedding-ada-002
EMBEDDING_API_KEY=your-openai-key
EMBEDDING_BASE_URL=https://api.openai.com/v1

LLM_MODEL=gpt-3.5-turbo
LLM_API_KEY=your-openai-key
LLM_BASE_URL=https://api.openai.com/v1

# Azure OpenAI
LLM_MODEL=azure/gpt-35-turbo
LLM_API_KEY=your-azure-key
LLM_BASE_URL=https://your-endpoint.openai.azure.com

# Anthropic
LLM_MODEL=claude-2
LLM_API_KEY=your-anthropic-key
LLM_BASE_URL=https://api.anthropic.com

# Google
LLM_MODEL=gemini/gemini-pro
LLM_API_KEY=your-google-key
LLM_BASE_URL=https://generativelanguage.googleapis.com

# Ollama
LLM_MODEL=ollama/llama2
LLM_BASE_URL=http://localhost:11434

API Reference

RAGPDF Class

class RAGPDF:
    def __init__(self, 
                 embedding_config: Union[Dict[str, Any], EmbeddingConfig],
                 llm_config: Optional[Union[Dict[str, Any], LLMConfig]] = None,
                 index_path: Optional[str] = None):
        """Initialize RAGPDF with embedding and LLM configurations."""

    async def add(self, pdf_path: str) -> None:
        """Add a PDF document to the system."""

    async def context(self, query: str, k: int = 5) -> RAGContext:
        """Get relevant context for a query."""

    async def chat(self, prompt: str, k: int = 5, stream: bool = False) -> Union[str, AsyncIterator[str]]:
        """Generate a response using the LLM based on context."""

Configuration Models

class BaseConfig:
    """Base configuration for API models."""
    model: str           # Model name (litellm compatible)
    api_key: str = ""   # API key (optional)
    api_base: str = None # API base URL (optional)

class EmbeddingConfig(BaseConfig):
    """Configuration for embedding model."""
    pass

class LLMConfig(BaseConfig):
    """Configuration for language model."""
    temperature: float = 0.7  # Response temperature (optional)
    max_tokens: int = None   # Maximum response length (optional)

Examples

Using Different LLM Providers

# OpenAI
rag = RAGPDF(
    embedding_config=EmbeddingConfig(
        model="text-embedding-ada-002",
        api_key="your-openai-key"
    ),
    llm_config=LLMConfig(
        model="gpt-3.5-turbo",
        api_key="your-openai-key"
    )
)

# Ollama (local)
rag = RAGPDF(
    embedding_config=EmbeddingConfig(
        model="ollama/nomic-embed-text",
        api_base="http://localhost:11434"
    ),
    llm_config=LLMConfig(
        model="ollama/llama2",
        api_base="http://localhost:11434"
    )
)

# Azure OpenAI
rag = RAGPDF(
    embedding_config=EmbeddingConfig(
        model="azure/text-embedding-ada-002",
        api_key="your-azure-key",
        api_base="https://your-endpoint.openai.azure.com"
    ),
    llm_config=LLMConfig(
        model="azure/gpt-35-turbo",
        api_key="your-azure-key",
        api_base="https://your-endpoint.openai.azure.com"
    )
)

Persistent Storage

# Initialize with index storage
rag = RAGPDF(
    embedding_config=embedding_config,
    llm_config=llm_config,
    index_path="data/faiss_index.bin"
)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Jan 18, 2025

0.1.1

Jan 18, 2025

0.1.0

Jan 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragpdf-0.1.2.tar.gz (13.5 kB view details)

Uploaded Jan 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragpdf-0.1.2-py3-none-any.whl (11.7 kB view details)

Uploaded Jan 18, 2025 Python 3

File details

Details for the file ragpdf-0.1.2.tar.gz.

File metadata

Download URL: ragpdf-0.1.2.tar.gz
Upload date: Jan 18, 2025
Size: 13.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for ragpdf-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`9444475ccbf9589dc422199943267f035df7d2c3a3f97f2d5a98ef1c1fe7d7a0`
MD5	`e6a1484e36bad23f18a7d604341d1d0a`
BLAKE2b-256	`e3f34a6d79b3aa9fbb052ce56ac7bf4db5c39ee0825a52d26002a5c8d9ac06e7`

See more details on using hashes here.

File details

Details for the file ragpdf-0.1.2-py3-none-any.whl.

File metadata

Download URL: ragpdf-0.1.2-py3-none-any.whl
Upload date: Jan 18, 2025
Size: 11.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for ragpdf-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b97c06f5b0c09f12dc3c0f67de1f2feb5a98fe48343c9b7a639ec7956cd52c32`
MD5	`d590da8c73bee49cad578fa2ff098574`
BLAKE2b-256	`1bee61bc6bf710e0ef18c7e105ddcbf0ac84775ff75ff600de1c50c3a0c48e12`

See more details on using hashes here.

ragpdf 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RAGPDF

Features

Installation

Quick Start

Context Inspection

RAGContext Class

Development Workflow

Context Analysis Examples

Model Configuration

OpenAI

Anthropic

Google

Ollama

Custom Endpoints

Environment Variables

API Reference

RAGPDF Class

Configuration Models

Examples

Using Different LLM Providers

Persistent Storage

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes