A comprehensive PDF processing toolkit for document workflows

These details have not been verified by PyPI

Project links

Project description

LlamaSearch PDF

A comprehensive PDF processing toolkit for document processing workflows.

Features

OCR (Optical Character Recognition) - Extract text from scanned PDFs and images with support for multiple OCR engines:
- Tesseract OCR
- OCRmyPDF
- Hugging Face Models
PDF Manipulation - Core utilities for working with PDF files:
- Merging multiple PDFs
- Splitting PDFs
- Converting images to PDFs
- Optimizing PDF file size
Text Extraction - Extract and process text from PDF documents:
- Direct text extraction from PDF content streams
- OCR fallback for scanned documents
- Text normalization and processing options
Metadata Management - Extract, update, and manage PDF document metadata:
- Standard document information (title, author, etc.)
- XMP metadata extraction
- Text-based metadata detection
Search and Indexing - Create searchable indices for PDF documents:
- Full-text search with TF-IDF ranking
- Page-level search results with context snippets
- Search across multiple documents
- Save and load search indices
Batch Processing - Process multiple files efficiently:
- Multi-threaded processing
- Directory processing with filtering
- Progress tracking and logging

Installation

pip install llamasearch-pdf

For additional OCR engines:

# For OCRmyPDF support
pip install llamasearch-pdf[ocrmypdf]

# For Hugging Face OCR support
pip install llamasearch-pdf[huggingface]

# For search example functionality
pip install llamasearch-pdf[search]

# For all features
pip install llamasearch-pdf[all]

Command-Line Usage

The package provides a command-line interface for common operations:

OCR Operations

# List available OCR engines
llamasearch-pdf ocr list-engines

# OCR a single PDF file
llamasearch-pdf ocr file document.pdf --output document_ocr.pdf

# Extract text from a PDF via OCR
llamasearch-pdf ocr file document.pdf --format text --output document_text.txt

# Process a directory of PDFs
llamasearch-pdf ocr directory ./documents --output ./documents_ocr --format pdf --recursive

Text Extraction

# Extract text from a PDF
llamasearch-pdf text extract document.pdf --output document_text.txt

# Extract text with layout preservation
llamasearch-pdf text extract document.pdf --preserve-layout --output document_text.txt

# Batch extract text from a directory
llamasearch-pdf text batch ./documents --output ./extracted_text --recursive

Metadata Operations

# Extract metadata from a PDF
llamasearch-pdf metadata extract document.pdf

# Extract detailed metadata including XMP
llamasearch-pdf metadata extract document.pdf --detailed

# Update metadata in a PDF
llamasearch-pdf metadata update document.pdf --title "New Title" --author "New Author"

# Extract metadata from text content
llamasearch-pdf metadata text document.pdf

# Batch extract metadata from a directory
llamasearch-pdf metadata batch ./documents --output ./metadata --recursive

Search Operations

# Create a search index for a directory of PDFs
llamasearch-pdf search create-index ./documents --output-path ./search_index.pkl

# Search using a saved index
llamasearch-pdf search search "quantum computing" --index-path ./search_index.pkl

# Search specific PDF files directly
llamasearch-pdf search search "neural networks" --files doc1.pdf doc2.pdf doc3.pdf

# Search with custom options and JSON output
llamasearch-pdf search search "machine learning" --index-path ./search_index.pkl --max-results 20 --min-score 0.5 --json

Python API Usage

OCR Operations

from llamasearch_pdf.ocr import ocr_pdf, ocr_image, process_directory

# OCR a PDF file
ocr_pdf('document.pdf', 'document_ocr.pdf')

# Extract text from an image
text = ocr_image('scan.jpg')

# Process a directory of PDFs and images
results = process_directory('documents/', 'documents_ocr/', output_format='text')

Text Extraction

from llamasearch_pdf.core import extract_text, TextExtractor

# Simple text extraction
text_dict = extract_text('document.pdf', preserve_layout=True)

# Advanced extraction with custom options
extractor = TextExtractor(
    preserve_layout=True,
    remove_hyphenation=True,
    normalize_whitespace=True,
    fallback_to_ocr=True
)
text_dict = extractor.extract_text_from_pdf('document.pdf')

Metadata Operations

from llamasearch_pdf.core import extract_metadata, update_metadata, MetadataManager

# Extract metadata
metadata = extract_metadata('document.pdf')

# Update metadata
new_metadata = {'/Title': 'New Document Title', '/Author': 'Document Author'}
update_metadata('document.pdf', 'updated_document.pdf', new_metadata)

# Advanced metadata operations
manager = MetadataManager()
metadata = manager.extract_metadata('document.pdf')
manager.print_metadata_summary(metadata)

# Extract metadata from text content
text_dict = extract_text('document.pdf')
text_content = "\n".join(text_dict.values())
text_metadata = manager.extract_text_metadata(text_content)

Search Operations

from llamasearch_pdf.search import create_index, search_pdfs, SearchIndex

# Create a search index
index = create_index(case_sensitive=False)

# Add documents to the index
index.add_document('document1.pdf')
index.add_document('document2.pdf')

# Save the index for later use
index.save('search_index.pkl')

# Load a previously saved index
loaded_index = create_index(index_path='search_index.pkl')

# Search the index
results = loaded_index.search('quantum computing', max_results=10)

# Display search results
for result in results:
    print(f"Document: {result.document_path}, Page: {result.page_number}")
    print(f"Score: {result.score}")
    print(f"Context: {result.snippet}")
    print()

# Quick search without creating an explicit index
results = search_pdfs('neural networks', ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'])

PDF Processing

from llamasearch_pdf.core.processor import PDFProcessor

# Initialize the processor
processor = PDFProcessor()

# Convert images to PDFs
pdf_files = processor.batch_convert_images(['image1.jpg', 'image2.png'])

# Merge PDFs
merged_pdf = processor.merge_pdfs(pdf_files, 'merged.pdf')

# Process a directory with PDFs and images
processor.process_directory('documents/', merge=True, optimize=True)

Requirements

Python 3.8+
For Tesseract OCR: Tesseract must be installed on your system
For OCRmyPDF: OCRmyPDF must be installed on your system
For image processing: Poppler must be installed for PDF to image conversion

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llamasearch_pdf_llamasearch-0.1.0.tar.gz (65.4 kB view details)

Uploaded Apr 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llamasearch_pdf_llamasearch-0.1.0-py3-none-any.whl (51.6 kB view details)

Uploaded Apr 4, 2025 Python 3

File details

Details for the file llamasearch_pdf_llamasearch-0.1.0.tar.gz.

File metadata

Download URL: llamasearch_pdf_llamasearch-0.1.0.tar.gz
Upload date: Apr 4, 2025
Size: 65.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llamasearch_pdf_llamasearch-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4e6b75d84d08167a9db715b61a5e333c2f4adf814356dad832c9e83e22e3b60d`
MD5	`3affb2493573a5936202163eb20e9fc6`
BLAKE2b-256	`a40d6bb992931485f2637c93dbd71d645c80896cee21f502ec05ed2e8fbfbce5`

See more details on using hashes here.

File details

Details for the file llamasearch_pdf_llamasearch-0.1.0-py3-none-any.whl.

File metadata

Download URL: llamasearch_pdf_llamasearch-0.1.0-py3-none-any.whl
Upload date: Apr 4, 2025
Size: 51.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llamasearch_pdf_llamasearch-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`61904c4802b17ef5b37042a7abe687c92d4b20996a00c764a95f6772b56d250b`
MD5	`a6192273d7f2267efc5a79238c3d7f4c`
BLAKE2b-256	`3bbcdbf61824d157cc699b40317a5c2fd9392a54531228ff7319b4deb30c0062`

See more details on using hashes here.

llamasearch-pdf-llamasearch 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LlamaSearch PDF

Features

Installation

Command-Line Usage

OCR Operations

Text Extraction

Metadata Operations

Search Operations

Python API Usage

OCR Operations

Text Extraction

Metadata Operations

Search Operations

PDF Processing

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes