Retrive PDF files context for your LLMs
Project description
RAGPDF
A Python package for Retrieval-Augmented Generation (RAG) using PDFs. RAGPDF makes it easy to extract, embed, and query content from PDF documents using modern language models.
Features
- Easy to Use: Simple API for adding PDFs and querying their content
- PDF Processing: Automatic text extraction and chunking from PDF documents
- Vector Search: Fast similarity search using FAISS
- Async Support: Built with asyncio for high performance
- LLM Integration: Seamless integration with various LLM providers through litellm
- Configurable: Flexible configuration for embedding and LLM models
- Persistent Storage: Optional FAISS index persistence
- Context Inspection: Access and analyze intermediate context for better control
Installation
pip install ragpdf
Quick Start
import asyncio
from ragpdf import RAGPDF, EmbeddingConfig, LLMConfig
# Configure your models
embedding_config = EmbeddingConfig(
model="text-embedding-ada-002", # OpenAI embedding model
api_key="your-api-key",
api_base="https://api.openai.com/v1" # Optional: default OpenAI base URL
)
llm_config = LLMConfig(
model="gpt-3.5-turbo", # OpenAI chat model
api_key="your-api-key",
api_base="https://api.openai.com/v1", # Optional: default OpenAI base URL
temperature=0.7
)
# Create RAGPDF instance
rag = RAGPDF(embedding_config, llm_config)
async def main():
# Add a PDF
await rag.add("document.pdf")
# Get and inspect context
context = await rag.context("What is this document about?")
# View context in different formats
print("\nFormatted context:")
print(context.to_string()) # Human-readable format
print("\nJSON format for detailed inspection:")
print(context.to_json()) # Structured format for analysis
# Use the context for chat
response = await rag.chat("Summarize the key points")
print("\nAI Response:")
print(response)
if __name__ == "__main__":
asyncio.run(main())
Context Inspection
RAGPDF provides powerful context inspection capabilities, allowing you to examine and validate the intermediate context used for RAG. This is particularly useful during development and debugging.
RAGContext Class
class RAGContext:
"""Context information for RAG operations."""
query: str # Original query
chunks: List[DocumentChunk] # Retrieved text chunks
files: List[str] # Source PDF files
total_chunks: int # Total chunks found
def to_string(self) -> str:
"""Convert context to human-readable format."""
# Example output:
# Query: What is the main topic?
# Found 3 relevant chunks from 2 files:
# document1.pdf, document2.pdf
#
# From document1.pdf (page 1):
# [chunk content...]
def to_json(self) -> str:
"""Convert context to JSON for detailed analysis."""
# Returns structured JSON with all context details
Development Workflow
async def development_workflow():
rag = RAGPDF(embedding_config, llm_config)
await rag.add("document.pdf")
# 1. Inspect retrieved context
context = await rag.context("What is the main topic?")
# Check which files were used
print(f"Retrieved chunks from: {context.files}")
# Examine individual chunks
for chunk in context.chunks:
print(f"\nFrom {chunk.file}" +
(f" (page {chunk.page})" if chunk.page else ""))
print(chunk.content)
# 2. Validate context quality
if not any("relevant keyword" in chunk.content
for chunk in context.chunks):
print("Warning: Expected content not found in context")
# 3. Generate response with validated context
response = await rag.chat("What is the main topic?")
print("\nAI Response:", response)
Context Analysis Examples
async def analyze_context():
rag = RAGPDF(embedding_config, llm_config)
# Add multiple PDFs
for pdf in ["doc1.pdf", "doc2.pdf"]:
await rag.add(pdf)
# Get context for analysis
context = await rag.context("What are the key findings?")
# 1. Source distribution analysis
file_distribution = {}
for chunk in context.chunks:
file_distribution[chunk.file] = file_distribution.get(chunk.file, 0) + 1
print("\nChunk distribution across files:")
for file, count in file_distribution.items():
print(f"{file}: {count} chunks")
# 2. Content relevance check
query_terms = set(context.query.lower().split())
relevant_chunks = []
for chunk in context.chunks:
chunk_terms = set(chunk.content.lower().split())
overlap = len(query_terms & chunk_terms)
relevant_chunks.append({
'file': chunk.file,
'page': chunk.page,
'term_overlap': overlap
})
print("\nChunk relevance analysis:")
for chunk in sorted(relevant_chunks,
key=lambda x: x['term_overlap'],
reverse=True):
print(f"File: {chunk['file']}, "
f"Page: {chunk['page']}, "
f"Term overlap: {chunk['term_overlap']}")
Model Configuration
RAGPDF uses litellm under the hood, making it compatible with any LLM provider supported by litellm. The model name and configuration must follow litellm's format.
OpenAI
# OpenAI API
config = LLMConfig(
model="gpt-3.5-turbo",
api_key="your-openai-key",
api_base="https://api.openai.com/v1" # Default OpenAI base URL
)
# Azure OpenAI
config = LLMConfig(
model="azure/gpt-35-turbo", # Prefix with 'azure/'
api_key="your-azure-key",
api_base="https://your-endpoint.openai.azure.com"
)
Anthropic
config = LLMConfig(
model="claude-2",
api_key="your-anthropic-key",
api_base="https://api.anthropic.com" # Default Anthropic base URL
)
config = LLMConfig(
model="gemini/gemini-pro", # Prefix with 'gemini/'
api_key="your-google-key",
api_base="https://generativelanguage.googleapis.com"
)
Ollama
config = LLMConfig(
model="ollama/llama2", # Prefix with 'ollama/'
api_base="http://localhost:11434" # Local Ollama server
)
Custom Endpoints
# Self-hosted LLM API
config = LLMConfig(
model="your-model-name",
api_base="http://your-custom-endpoint:8000/v1",
api_key="optional-key" # Optional for self-hosted
)
Environment Variables
RAGPDF supports configuration through environment variables. The api_base is optional and defaults to the provider's standard endpoint:
# OpenAI
EMBEDDING_MODEL=text-embedding-ada-002
EMBEDDING_API_KEY=your-openai-key
EMBEDDING_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-3.5-turbo
LLM_API_KEY=your-openai-key
LLM_BASE_URL=https://api.openai.com/v1
# Azure OpenAI
LLM_MODEL=azure/gpt-35-turbo
LLM_API_KEY=your-azure-key
LLM_BASE_URL=https://your-endpoint.openai.azure.com
# Anthropic
LLM_MODEL=claude-2
LLM_API_KEY=your-anthropic-key
LLM_BASE_URL=https://api.anthropic.com
# Google
LLM_MODEL=gemini/gemini-pro
LLM_API_KEY=your-google-key
LLM_BASE_URL=https://generativelanguage.googleapis.com
# Ollama
LLM_MODEL=ollama/llama2
LLM_BASE_URL=http://localhost:11434
API Reference
RAGPDF Class
class RAGPDF:
def __init__(self,
embedding_config: Union[Dict[str, Any], EmbeddingConfig],
llm_config: Optional[Union[Dict[str, Any], LLMConfig]] = None,
index_path: Optional[str] = None):
"""Initialize RAGPDF with embedding and LLM configurations."""
async def add(self, pdf_path: str) -> None:
"""Add a PDF document to the system."""
async def context(self, query: str, k: int = 5) -> RAGContext:
"""Get relevant context for a query."""
async def chat(self, prompt: str, k: int = 5, stream: bool = False) -> Union[str, AsyncIterator[str]]:
"""Generate a response using the LLM based on context."""
Configuration Models
class BaseConfig:
"""Base configuration for API models."""
model: str # Model name (litellm compatible)
api_key: str = "" # API key (optional)
api_base: str = None # API base URL (optional)
class EmbeddingConfig(BaseConfig):
"""Configuration for embedding model."""
pass
class LLMConfig(BaseConfig):
"""Configuration for language model."""
temperature: float = 0.7 # Response temperature (optional)
max_tokens: int = None # Maximum response length (optional)
Examples
Using Different LLM Providers
# OpenAI
rag = RAGPDF(
embedding_config=EmbeddingConfig(
model="text-embedding-ada-002",
api_key="your-openai-key"
),
llm_config=LLMConfig(
model="gpt-3.5-turbo",
api_key="your-openai-key"
)
)
# Ollama (local)
rag = RAGPDF(
embedding_config=EmbeddingConfig(
model="ollama/nomic-embed-text",
api_base="http://localhost:11434"
),
llm_config=LLMConfig(
model="ollama/llama2",
api_base="http://localhost:11434"
)
)
# Azure OpenAI
rag = RAGPDF(
embedding_config=EmbeddingConfig(
model="azure/text-embedding-ada-002",
api_key="your-azure-key",
api_base="https://your-endpoint.openai.azure.com"
),
llm_config=LLMConfig(
model="azure/gpt-35-turbo",
api_key="your-azure-key",
api_base="https://your-endpoint.openai.azure.com"
)
)
Persistent Storage
# Initialize with index storage
rag = RAGPDF(
embedding_config=embedding_config,
llm_config=llm_config,
index_path="data/faiss_index.bin"
)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragpdf-0.1.2.tar.gz.
File metadata
- Download URL: ragpdf-0.1.2.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9444475ccbf9589dc422199943267f035df7d2c3a3f97f2d5a98ef1c1fe7d7a0
|
|
| MD5 |
e6a1484e36bad23f18a7d604341d1d0a
|
|
| BLAKE2b-256 |
e3f34a6d79b3aa9fbb052ce56ac7bf4db5c39ee0825a52d26002a5c8d9ac06e7
|
File details
Details for the file ragpdf-0.1.2-py3-none-any.whl.
File metadata
- Download URL: ragpdf-0.1.2-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b97c06f5b0c09f12dc3c0f67de1f2feb5a98fe48343c9b7a639ec7956cd52c32
|
|
| MD5 |
d590da8c73bee49cad578fa2ff098574
|
|
| BLAKE2b-256 |
1bee61bc6bf710e0ef18c7e105ddcbf0ac84775ff75ff600de1c50c3a0c48e12
|