Document preprocessing library for PDF ingestion, rendering, enhancement, and classification

These details have not been verified by PyPI

Project links

Project description

document-preprocessor

A production-ready document preprocessing library for PDF ingestion, rendering, enhancement, and classification. This library provides the complete preprocessing lifecycle for PDF documents before OCR, vision analysis, extraction, and AI processing.

Overview

document-preprocessor is a core component within a larger Document Intelligence platform. It handles:

PDF ingestion and page splitting
High-resolution page rendering
Image enhancement (deskewing, contrast, noise reduction, upscaling)
Page classification for routing
Document complexity analysis
Content deduplication
Complete preprocessing orchestration

Architecture

The library follows Clean Architecture principles with SOLID design, Domain-Driven Design, dependency injection, and async-first APIs.

Core Components

PdfSplitter - Splits PDFs into document-core Page objects
PageRenderer - Renders PDF pages to high-resolution images
ImageEnhancer - Enhances images with deskewing, contrast, noise reduction, and optional upscaling
PageClassifier - Classifies pages for routing (planogram, table, cover, appendix, unknown)
ComplexityAnalyzer - Analyzes document complexity to recommend processing mode
ContentDeduplicator - Detects and removes duplicate pages
PreprocessorPipeline - Orchestrates the complete preprocessing workflow

Processing Flow

PDF Input
    ↓
Phase 1: Split PDF → Pages
    ↓
Phase 2: Deduplicate Pages
    ↓
Phase 3: Render Pages → Images
    ↓
Phase 4: Enhance Images
    ↓
Phase 5: Classify Pages
    ↓
Phase 6: Compute Complexity
    ↓
Phase 7: Cache (optional)
    ↓
PreprocessResult

Installation

Requirements

Python >= 3.11
document-core >= 0.1.0
PyMuPDF >= 1.24
pdf2image >= 1.17
opencv-python-headless >= 4.9
Pillow >= 10.0
pydantic >= 2.0

Optional Dependencies

realesrgan - For AI-based image upscaling (install with pip install document-preprocessor[upscale])

Install from Source

cd document-preprocessor
pip install -e .

Install with Optional Dependencies

pip install -e ".[upscale,dev]"

Configuration

from document_preprocessor import PreprocessorConfig

config = PreprocessorConfig(
    render_dpi=300,
    image_format="png",
    temp_directory="/tmp/document-preprocessor",
    enable_parallel_rendering=True,
    enable_parallel_enhancement=True,
    enable_deduplication=True,
    cache_enabled=True,
    classification_confidence_threshold=0.80,
    complexity_simple_threshold=25,
    complexity_standard_threshold=60,
    max_workers=8,
)

# Or load from environment variables
config = PreprocessorConfig.from_env()

Environment Variables

PREPROCESSOR_RENDER_DPI - Rendering DPI (default: 300)
PREPROCESSOR_IMAGE_FORMAT - Output image format (default: png)
PREPROCESSOR_TEMP_DIR - Temporary directory (default: /tmp/document-preprocessor)
PREPROCESSOR_PARALLEL_RENDER - Enable parallel rendering (default: true)
PREPROCESSOR_PARALLEL_ENHANCE - Enable parallel enhancement (default: true)
PREPROCESSOR_MAX_WORKERS - Maximum worker threads (default: 8)
PREPROCESSOR_ENABLE_DEDUP - Enable deduplication (default: true)
PREPROCESSOR_CACHE_ENABLED - Enable caching (default: true)
PREPROCESSOR_CLASSIFICATION_THRESHOLD - Classification confidence threshold (default: 0.80)
PREPROCESSOR_COMPLEXITY_SIMPLE - Simple complexity threshold (default: 25)
PREPROCESSOR_COMPLEXITY_STANDARD - Standard complexity threshold (default: 60)

Usage

Basic Pipeline Usage

import asyncio
from document_preprocessor import (
    PreprocessorPipeline,
    PdfSplitter,
    PageRenderer,
    ImageEnhancer,
    PageClassifier,
    ComplexityAnalyzer,
    PreprocessorConfig,
)

async def process_pdf(pdf_path: str):
    # Initialize components
    config = PreprocessorConfig()
    splitter = PdfSplitter()
    renderer = PageRenderer(dpi=config.render_dpi, image_format=config.image_format)
    enhancer = ImageEnhancer(temp_directory=config.temp_directory)
    classifier = PageClassifier(confidence_threshold=config.classification_confidence_threshold)
    analyzer = ComplexityAnalyzer(
        simple_threshold=config.complexity_simple_threshold,
        standard_threshold=config.complexity_standard_threshold,
    )
    
    # Create pipeline
    pipeline = PreprocessorPipeline(
        splitter=splitter,
        renderer=renderer,
        enhancer=enhancer,
        classifier=classifier,
        analyzer=analyzer,
        config=config,
    )
    
    # Process PDF
    result = await pipeline.process(pdf_path)
    
    # Access results
    print(f"Processed {len(result.document.pages)} pages")
    print(f"Complexity: {result.complexity.overall_score:.1f}")
    print(f"Recommended mode: {result.complexity.recommended_mode}")
    print(f"Reasoning: {result.complexity.reasoning}")
    
    return result

# Run pipeline
result = asyncio.run(process_pdf("document.pdf"))

Individual Component Usage

PDF Splitting

from document_preprocessor import PdfSplitter

splitter = PdfSplitter()
pages = splitter.split("document.pdf")

# Process in batches
batches = splitter.split_to_batches("document.pdf", batch_size=10)

Page Rendering

from document_preprocessor import PageRenderer

renderer = PageRenderer(dpi=300, image_format="png")
image_path = renderer.render(page)

# Batch rendering
image_paths = renderer.render_batch(pages, parallel=True)

Image Enhancement

from document_preprocessor import ImageEnhancer, EnhancerConfig

config = EnhancerConfig(
    enable_deskew=True,
    enable_contrast=True,
    enable_upscale=True,
    enable_binarization=True,
)

enhancer = ImageEnhancer(config=config)
enhanced_path = enhancer.enhance(image_path, current_dpi=150)

Page Classification

from document_preprocessor import PageClassifier

classifier = PageClassifier(confidence_threshold=0.80)
page_type = classifier.classify(page)

# Batch classification
classifications = classifier.classify_batch(pages)

Complexity Analysis

from document_preprocessor import ComplexityAnalyzer

analyzer = ComplexityAnalyzer(simple_threshold=25, standard_threshold=60)
complexity = analyzer.score_document(pages)

print(f"Overall score: {complexity.overall_score}")
print(f"Recommended mode: {complexity.recommended_mode}")
print(f"Reasoning: {complexity.reasoning}")

Content Deduplication

from document_preprocessor import ContentDeduplicator

deduplicator = ContentDeduplicator()

# Find duplicates
duplicates = deduplicator.find_duplicates(pages)

# Remove duplicates
deduplicated_pages = deduplicator.remove_duplicates(pages)

Page Classification Rules

The classifier uses heuristic rules to categorize pages:

PLANOGRAM: image_area_ratio > 0.60
TABLE: detected_table_regions > 2 and image_area_ratio < 0.30
COVER: page_number == 1 and raw_char_count < 500
APPENDIX: Detected via keyword analysis (appendix, glossary, references, notes)
UNKNOWN: Fallback for unclassified pages

Complexity Scoring

Complexity is scored on a scale of 0-100 across three dimensions:

Layout Score

Image density (40 points)
Shelf regions (30 points)
Region count (20 points)
Mixed layout penalty (10 points)

OCR Score

Small text ratio (40 points)
Rotation (30 points)
Dense annotations (30 points)

Structure Score

Table regions (50 points)
Nested layouts (30 points)
Page position (20 points)

Mode Selection

FAST: Overall score < 25
BALANCED: Overall score < 60
HIGH_ACCURACY: Overall score >= 60

Performance Tuning

Parallel Processing

Enable parallel rendering and enhancement for large documents:

config = PreprocessorConfig(
    enable_parallel_rendering=True,
    enable_parallel_enhancement=True,
    max_workers=16,  # Adjust based on CPU cores
)

Memory Management

For very large PDFs (1000+ pages):

Process in batches using split_to_batches()
Increase temp directory size
Monitor memory usage
Use cache to avoid reprocessing

Upscaling

Enable AI-based upscaling for low-DPI documents:

config = EnhancerConfig(
    enable_upscale=True,
    upscale_threshold_dpi=150,
    upscale_factor=2,
)

# Install optional dependency
pip install realesrgan

Extending Classification

Custom CNN Classifier

from document_preprocessor import PageClassifier

def custom_cnn_classifier(page: Page) -> PageType:
    # Your CNN logic here
    return PageType.PLANOGRAM

classifier = PageClassifier(
    confidence_threshold=0.80,
    classifier_model=custom_cnn_classifier,
)

Custom Classification Rules

Extend the PageClassifier class to add custom rules:

from document_preprocessor.classifier import PageClassifier

class CustomClassifier(PageClassifier):
    def _classify_heuristic(self, page: Page) -> PageType:
        # Add your custom logic
        if page.metadata.image_area_ratio > 0.80:
            return PageType.PLANOGRAM
        
        return super()._classify_heuristic(page)

Troubleshooting

PDF Splitting Errors

Error: PdfSplitError: Failed to open PDF

Solution: Ensure the PDF file exists and is not corrupted. Verify file permissions.

Rendering Errors

Error: RenderingError: Failed to render page

Solution: Check that PyMuPDF is installed correctly. Verify the PDF is not password-protected.

Enhancement Errors

Error: EnhancementError: Image enhancement failed

Solution: Ensure OpenCV is installed. Check that the image file exists and is readable.

Memory Issues

Error: High memory usage with large PDFs

Solution:

Process in batches
Reduce DPI
Enable deduplication to reduce page count
Increase system memory or use a machine with more RAM

Real-ESRGAN Issues

Error: Real-ESRGAN not available

Solution: Install the optional dependency:

pip install realesrgan

If issues persist, the library will gracefully fall back to interpolation-based upscaling.

Development Guide

Running Tests

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=document_preprocessor --cov-report=html

# Run specific test file
pytest tests/test_splitter.py

Code Style

# Format code with black
black document_preprocessor/

# Lint with ruff
ruff check document_preprocessor/

# Type check with mypy
mypy document_preprocessor/

Project Structure

document-preprocessor/
├── pyproject.toml
├── README.md
├── document_preprocessor/
│   ├── __init__.py
│   ├── config.py
│   ├── models.py
│   ├── exceptions.py
│   ├── splitter.py
│   ├── renderer.py
│   ├── enhancer.py
│   ├── classifier.py
│   ├── complexity.py
│   ├── dedup.py
│   └── pipeline.py
├── tests/
│   ├── test_splitter.py
│   ├── test_renderer.py
│   ├── test_enhancer.py
│   ├── test_classifier.py
│   ├── test_complexity.py
│   ├── test_dedup.py
│   └── test_pipeline.py
└── docs/

Design Principles

SOLID - Single responsibility, open/closed, Liskov substitution, interface segregation, dependency inversion
Clean Architecture - Separation of concerns, dependency injection
Domain-Driven Design - Rich domain models, ubiquitous language
Async-First - Non-blocking operations for high throughput
High Performance - ThreadPoolExecutor for parallel processing, lazy image loading
Memory Efficient - Streaming operations, temporary file cleanup
Type Safety - Complete type hints, mypy validation
Pydantic Validation - Strict data validation with ConfigDict
Extensible - Plugin architecture for custom classifiers and enhancers
Testable - Dependency injection, unit tests for all components
Production Observability - Structured logging, error tracking

Dependencies

Internal

document-core - Shared models, enums, interfaces, and utilities

External

PyMuPDF - PDF processing and rendering
pdf2image - PDF to image conversion
opencv-python-headless - Image enhancement
Pillow - Image I/O
pydantic - Data validation

Optional

realesrgan - AI-based image upscaling

License

MIT License - PepsiCo

Support

For issues, questions, or contributions, please contact the PepsiCo AI Team.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pepsico_document_preprocessor-0.1.0.tar.gz (27.4 kB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pepsico_document_preprocessor-0.1.0-py3-none-any.whl (25.4 kB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file pepsico_document_preprocessor-0.1.0.tar.gz.

File metadata

Download URL: pepsico_document_preprocessor-0.1.0.tar.gz
Upload date: Jun 15, 2026
Size: 27.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for pepsico_document_preprocessor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b528aaebdea4746a98d361cc58dc0754abecde769bfd6402a1ddfe4a4acc7723`
MD5	`097b0c6e59663f0939b9b9cf9c77a6ef`
BLAKE2b-256	`32188e03385baddd65f8d3256fa5315cec9801ee087a2d4a524d6e7cad08f046`

See more details on using hashes here.

File details

Details for the file pepsico_document_preprocessor-0.1.0-py3-none-any.whl.

File metadata

Download URL: pepsico_document_preprocessor-0.1.0-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 25.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for pepsico_document_preprocessor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`141acf04abea6324da5259894fc065a3aea20ac488e224b63d4d98270416aecd`
MD5	`c2376c4a43d1eaf5f30dda09e525f861`
BLAKE2b-256	`9cc2e2f25e1c48bc21b58c83f1872b0726c89a6c668dbf3603657fbd2272b54b`

See more details on using hashes here.

pepsico-document-preprocessor 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

document-preprocessor

Overview

Architecture

Core Components

Processing Flow

Installation

Requirements

Optional Dependencies

Install from Source

Install with Optional Dependencies

Configuration

Environment Variables

Usage

Basic Pipeline Usage

Individual Component Usage

PDF Splitting

Page Rendering

Image Enhancement

Page Classification

Complexity Analysis

Content Deduplication

Page Classification Rules

Complexity Scoring

Layout Score

OCR Score

Structure Score

Mode Selection

Performance Tuning

Parallel Processing

Memory Management

Upscaling

Extending Classification

Custom CNN Classifier

Custom Classification Rules

Troubleshooting

PDF Splitting Errors

Rendering Errors

Enhancement Errors

Memory Issues

Real-ESRGAN Issues

Development Guide

Running Tests

Code Style

Project Structure

Design Principles

Dependencies

Internal

External

Optional

License

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes