A comprehensive PDF processing toolkit for document workflows
Project description
LlamaSearch PDF
A comprehensive PDF processing toolkit for document processing workflows.
Features
-
OCR (Optical Character Recognition) - Extract text from scanned PDFs and images with support for multiple OCR engines:
- Tesseract OCR
- OCRmyPDF
- Hugging Face Models
-
PDF Manipulation - Core utilities for working with PDF files:
- Merging multiple PDFs
- Splitting PDFs
- Converting images to PDFs
- Optimizing PDF file size
-
Text Extraction - Extract and process text from PDF documents:
- Direct text extraction from PDF content streams
- OCR fallback for scanned documents
- Text normalization and processing options
-
Metadata Management - Extract, update, and manage PDF document metadata:
- Standard document information (title, author, etc.)
- XMP metadata extraction
- Text-based metadata detection
-
Search and Indexing - Create searchable indices for PDF documents:
- Full-text search with TF-IDF ranking
- Page-level search results with context snippets
- Search across multiple documents
- Save and load search indices
-
Batch Processing - Process multiple files efficiently:
- Multi-threaded processing
- Directory processing with filtering
- Progress tracking and logging
Installation
pip install llamasearch-pdf
For additional OCR engines:
# For OCRmyPDF support
pip install llamasearch-pdf[ocrmypdf]
# For Hugging Face OCR support
pip install llamasearch-pdf[huggingface]
# For search example functionality
pip install llamasearch-pdf[search]
# For all features
pip install llamasearch-pdf[all]
Command-Line Usage
The package provides a command-line interface for common operations:
OCR Operations
# List available OCR engines
llamasearch-pdf ocr list-engines
# OCR a single PDF file
llamasearch-pdf ocr file document.pdf --output document_ocr.pdf
# Extract text from a PDF via OCR
llamasearch-pdf ocr file document.pdf --format text --output document_text.txt
# Process a directory of PDFs
llamasearch-pdf ocr directory ./documents --output ./documents_ocr --format pdf --recursive
Text Extraction
# Extract text from a PDF
llamasearch-pdf text extract document.pdf --output document_text.txt
# Extract text with layout preservation
llamasearch-pdf text extract document.pdf --preserve-layout --output document_text.txt
# Batch extract text from a directory
llamasearch-pdf text batch ./documents --output ./extracted_text --recursive
Metadata Operations
# Extract metadata from a PDF
llamasearch-pdf metadata extract document.pdf
# Extract detailed metadata including XMP
llamasearch-pdf metadata extract document.pdf --detailed
# Update metadata in a PDF
llamasearch-pdf metadata update document.pdf --title "New Title" --author "New Author"
# Extract metadata from text content
llamasearch-pdf metadata text document.pdf
# Batch extract metadata from a directory
llamasearch-pdf metadata batch ./documents --output ./metadata --recursive
Search Operations
# Create a search index for a directory of PDFs
llamasearch-pdf search create-index ./documents --output-path ./search_index.pkl
# Search using a saved index
llamasearch-pdf search search "quantum computing" --index-path ./search_index.pkl
# Search specific PDF files directly
llamasearch-pdf search search "neural networks" --files doc1.pdf doc2.pdf doc3.pdf
# Search with custom options and JSON output
llamasearch-pdf search search "machine learning" --index-path ./search_index.pkl --max-results 20 --min-score 0.5 --json
Python API Usage
OCR Operations
from llamasearch_pdf.ocr import ocr_pdf, ocr_image, process_directory
# OCR a PDF file
ocr_pdf('document.pdf', 'document_ocr.pdf')
# Extract text from an image
text = ocr_image('scan.jpg')
# Process a directory of PDFs and images
results = process_directory('documents/', 'documents_ocr/', output_format='text')
Text Extraction
from llamasearch_pdf.core import extract_text, TextExtractor
# Simple text extraction
text_dict = extract_text('document.pdf', preserve_layout=True)
# Advanced extraction with custom options
extractor = TextExtractor(
preserve_layout=True,
remove_hyphenation=True,
normalize_whitespace=True,
fallback_to_ocr=True
)
text_dict = extractor.extract_text_from_pdf('document.pdf')
Metadata Operations
from llamasearch_pdf.core import extract_metadata, update_metadata, MetadataManager
# Extract metadata
metadata = extract_metadata('document.pdf')
# Update metadata
new_metadata = {'/Title': 'New Document Title', '/Author': 'Document Author'}
update_metadata('document.pdf', 'updated_document.pdf', new_metadata)
# Advanced metadata operations
manager = MetadataManager()
metadata = manager.extract_metadata('document.pdf')
manager.print_metadata_summary(metadata)
# Extract metadata from text content
text_dict = extract_text('document.pdf')
text_content = "\n".join(text_dict.values())
text_metadata = manager.extract_text_metadata(text_content)
Search Operations
from llamasearch_pdf.search import create_index, search_pdfs, SearchIndex
# Create a search index
index = create_index(case_sensitive=False)
# Add documents to the index
index.add_document('document1.pdf')
index.add_document('document2.pdf')
# Save the index for later use
index.save('search_index.pkl')
# Load a previously saved index
loaded_index = create_index(index_path='search_index.pkl')
# Search the index
results = loaded_index.search('quantum computing', max_results=10)
# Display search results
for result in results:
print(f"Document: {result.document_path}, Page: {result.page_number}")
print(f"Score: {result.score}")
print(f"Context: {result.snippet}")
print()
# Quick search without creating an explicit index
results = search_pdfs('neural networks', ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'])
PDF Processing
from llamasearch_pdf.core.processor import PDFProcessor
# Initialize the processor
processor = PDFProcessor()
# Convert images to PDFs
pdf_files = processor.batch_convert_images(['image1.jpg', 'image2.png'])
# Merge PDFs
merged_pdf = processor.merge_pdfs(pdf_files, 'merged.pdf')
# Process a directory with PDFs and images
processor.process_directory('documents/', merge=True, optimize=True)
Requirements
- Python 3.8+
- For Tesseract OCR: Tesseract must be installed on your system
- For OCRmyPDF: OCRmyPDF must be installed on your system
- For image processing: Poppler must be installed for PDF to image conversion
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llamasearch_pdf_llamasearch-0.1.0.tar.gz.
File metadata
- Download URL: llamasearch_pdf_llamasearch-0.1.0.tar.gz
- Upload date:
- Size: 65.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e6b75d84d08167a9db715b61a5e333c2f4adf814356dad832c9e83e22e3b60d
|
|
| MD5 |
3affb2493573a5936202163eb20e9fc6
|
|
| BLAKE2b-256 |
a40d6bb992931485f2637c93dbd71d645c80896cee21f502ec05ed2e8fbfbce5
|
File details
Details for the file llamasearch_pdf_llamasearch-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llamasearch_pdf_llamasearch-0.1.0-py3-none-any.whl
- Upload date:
- Size: 51.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61904c4802b17ef5b37042a7abe687c92d4b20996a00c764a95f6772b56d250b
|
|
| MD5 |
a6192273d7f2267efc5a79238c3d7f4c
|
|
| BLAKE2b-256 |
3bbcdbf61824d157cc699b40317a5c2fd9392a54531228ff7319b4deb30c0062
|