Skip to main content

Document preprocessing library for PDF ingestion, rendering, enhancement, and classification

Project description

document-preprocessor

A production-ready document preprocessing library for PDF ingestion, rendering, enhancement, and classification. This library provides the complete preprocessing lifecycle for PDF documents before OCR, vision analysis, extraction, and AI processing.

Overview

document-preprocessor is a core component within a larger Document Intelligence platform. It handles:

  • PDF ingestion and page splitting
  • High-resolution page rendering
  • Image enhancement (deskewing, contrast, noise reduction, upscaling)
  • Page classification for routing
  • Document complexity analysis
  • Content deduplication
  • Complete preprocessing orchestration

Architecture

The library follows Clean Architecture principles with SOLID design, Domain-Driven Design, dependency injection, and async-first APIs.

Core Components

  • PdfSplitter - Splits PDFs into document-core Page objects
  • PageRenderer - Renders PDF pages to high-resolution images
  • ImageEnhancer - Enhances images with deskewing, contrast, noise reduction, and optional upscaling
  • PageClassifier - Classifies pages for routing (planogram, table, cover, appendix, unknown)
  • ComplexityAnalyzer - Analyzes document complexity to recommend processing mode
  • ContentDeduplicator - Detects and removes duplicate pages
  • PreprocessorPipeline - Orchestrates the complete preprocessing workflow

Processing Flow

PDF Input
    ↓
Phase 1: Split PDF → Pages
    ↓
Phase 2: Deduplicate Pages
    ↓
Phase 3: Render Pages → Images
    ↓
Phase 4: Enhance Images
    ↓
Phase 5: Classify Pages
    ↓
Phase 6: Compute Complexity
    ↓
Phase 7: Cache (optional)
    ↓
PreprocessResult

Installation

Requirements

  • Python >= 3.11
  • document-core >= 0.1.0
  • PyMuPDF >= 1.24
  • pdf2image >= 1.17
  • opencv-python-headless >= 4.9
  • Pillow >= 10.0
  • pydantic >= 2.0

Optional Dependencies

  • realesrgan - For AI-based image upscaling (install with pip install document-preprocessor[upscale])

Install from Source

cd document-preprocessor
pip install -e .

Install with Optional Dependencies

pip install -e ".[upscale,dev]"

Configuration

from document_preprocessor import PreprocessorConfig

config = PreprocessorConfig(
    render_dpi=300,
    image_format="png",
    temp_directory="/tmp/document-preprocessor",
    enable_parallel_rendering=True,
    enable_parallel_enhancement=True,
    enable_deduplication=True,
    cache_enabled=True,
    classification_confidence_threshold=0.80,
    complexity_simple_threshold=25,
    complexity_standard_threshold=60,
    max_workers=8,
)

# Or load from environment variables
config = PreprocessorConfig.from_env()

Environment Variables

  • PREPROCESSOR_RENDER_DPI - Rendering DPI (default: 300)
  • PREPROCESSOR_IMAGE_FORMAT - Output image format (default: png)
  • PREPROCESSOR_TEMP_DIR - Temporary directory (default: /tmp/document-preprocessor)
  • PREPROCESSOR_PARALLEL_RENDER - Enable parallel rendering (default: true)
  • PREPROCESSOR_PARALLEL_ENHANCE - Enable parallel enhancement (default: true)
  • PREPROCESSOR_MAX_WORKERS - Maximum worker threads (default: 8)
  • PREPROCESSOR_ENABLE_DEDUP - Enable deduplication (default: true)
  • PREPROCESSOR_CACHE_ENABLED - Enable caching (default: true)
  • PREPROCESSOR_CLASSIFICATION_THRESHOLD - Classification confidence threshold (default: 0.80)
  • PREPROCESSOR_COMPLEXITY_SIMPLE - Simple complexity threshold (default: 25)
  • PREPROCESSOR_COMPLEXITY_STANDARD - Standard complexity threshold (default: 60)

Usage

Basic Pipeline Usage

import asyncio
from document_preprocessor import (
    PreprocessorPipeline,
    PdfSplitter,
    PageRenderer,
    ImageEnhancer,
    PageClassifier,
    ComplexityAnalyzer,
    PreprocessorConfig,
)

async def process_pdf(pdf_path: str):
    # Initialize components
    config = PreprocessorConfig()
    splitter = PdfSplitter()
    renderer = PageRenderer(dpi=config.render_dpi, image_format=config.image_format)
    enhancer = ImageEnhancer(temp_directory=config.temp_directory)
    classifier = PageClassifier(confidence_threshold=config.classification_confidence_threshold)
    analyzer = ComplexityAnalyzer(
        simple_threshold=config.complexity_simple_threshold,
        standard_threshold=config.complexity_standard_threshold,
    )
    
    # Create pipeline
    pipeline = PreprocessorPipeline(
        splitter=splitter,
        renderer=renderer,
        enhancer=enhancer,
        classifier=classifier,
        analyzer=analyzer,
        config=config,
    )
    
    # Process PDF
    result = await pipeline.process(pdf_path)
    
    # Access results
    print(f"Processed {len(result.document.pages)} pages")
    print(f"Complexity: {result.complexity.overall_score:.1f}")
    print(f"Recommended mode: {result.complexity.recommended_mode}")
    print(f"Reasoning: {result.complexity.reasoning}")
    
    return result

# Run pipeline
result = asyncio.run(process_pdf("document.pdf"))

Individual Component Usage

PDF Splitting

from document_preprocessor import PdfSplitter

splitter = PdfSplitter()
pages = splitter.split("document.pdf")

# Process in batches
batches = splitter.split_to_batches("document.pdf", batch_size=10)

Page Rendering

from document_preprocessor import PageRenderer

renderer = PageRenderer(dpi=300, image_format="png")
image_path = renderer.render(page)

# Batch rendering
image_paths = renderer.render_batch(pages, parallel=True)

Image Enhancement

from document_preprocessor import ImageEnhancer, EnhancerConfig

config = EnhancerConfig(
    enable_deskew=True,
    enable_contrast=True,
    enable_upscale=True,
    enable_binarization=True,
)

enhancer = ImageEnhancer(config=config)
enhanced_path = enhancer.enhance(image_path, current_dpi=150)

Page Classification

from document_preprocessor import PageClassifier

classifier = PageClassifier(confidence_threshold=0.80)
page_type = classifier.classify(page)

# Batch classification
classifications = classifier.classify_batch(pages)

Complexity Analysis

from document_preprocessor import ComplexityAnalyzer

analyzer = ComplexityAnalyzer(simple_threshold=25, standard_threshold=60)
complexity = analyzer.score_document(pages)

print(f"Overall score: {complexity.overall_score}")
print(f"Recommended mode: {complexity.recommended_mode}")
print(f"Reasoning: {complexity.reasoning}")

Content Deduplication

from document_preprocessor import ContentDeduplicator

deduplicator = ContentDeduplicator()

# Find duplicates
duplicates = deduplicator.find_duplicates(pages)

# Remove duplicates
deduplicated_pages = deduplicator.remove_duplicates(pages)

Page Classification Rules

The classifier uses heuristic rules to categorize pages:

  • PLANOGRAM: image_area_ratio > 0.60
  • TABLE: detected_table_regions > 2 and image_area_ratio < 0.30
  • COVER: page_number == 1 and raw_char_count < 500
  • APPENDIX: Detected via keyword analysis (appendix, glossary, references, notes)
  • UNKNOWN: Fallback for unclassified pages

Complexity Scoring

Complexity is scored on a scale of 0-100 across three dimensions:

Layout Score

  • Image density (40 points)
  • Shelf regions (30 points)
  • Region count (20 points)
  • Mixed layout penalty (10 points)

OCR Score

  • Small text ratio (40 points)
  • Rotation (30 points)
  • Dense annotations (30 points)

Structure Score

  • Table regions (50 points)
  • Nested layouts (30 points)
  • Page position (20 points)

Mode Selection

  • FAST: Overall score < 25
  • BALANCED: Overall score < 60
  • HIGH_ACCURACY: Overall score >= 60

Performance Tuning

Parallel Processing

Enable parallel rendering and enhancement for large documents:

config = PreprocessorConfig(
    enable_parallel_rendering=True,
    enable_parallel_enhancement=True,
    max_workers=16,  # Adjust based on CPU cores
)

Memory Management

For very large PDFs (1000+ pages):

  • Process in batches using split_to_batches()
  • Increase temp directory size
  • Monitor memory usage
  • Use cache to avoid reprocessing

Upscaling

Enable AI-based upscaling for low-DPI documents:

config = EnhancerConfig(
    enable_upscale=True,
    upscale_threshold_dpi=150,
    upscale_factor=2,
)

# Install optional dependency
pip install realesrgan

Extending Classification

Custom CNN Classifier

from document_preprocessor import PageClassifier

def custom_cnn_classifier(page: Page) -> PageType:
    # Your CNN logic here
    return PageType.PLANOGRAM

classifier = PageClassifier(
    confidence_threshold=0.80,
    classifier_model=custom_cnn_classifier,
)

Custom Classification Rules

Extend the PageClassifier class to add custom rules:

from document_preprocessor.classifier import PageClassifier

class CustomClassifier(PageClassifier):
    def _classify_heuristic(self, page: Page) -> PageType:
        # Add your custom logic
        if page.metadata.image_area_ratio > 0.80:
            return PageType.PLANOGRAM
        
        return super()._classify_heuristic(page)

Troubleshooting

PDF Splitting Errors

Error: PdfSplitError: Failed to open PDF

Solution: Ensure the PDF file exists and is not corrupted. Verify file permissions.

Rendering Errors

Error: RenderingError: Failed to render page

Solution: Check that PyMuPDF is installed correctly. Verify the PDF is not password-protected.

Enhancement Errors

Error: EnhancementError: Image enhancement failed

Solution: Ensure OpenCV is installed. Check that the image file exists and is readable.

Memory Issues

Error: High memory usage with large PDFs

Solution:

  • Process in batches
  • Reduce DPI
  • Enable deduplication to reduce page count
  • Increase system memory or use a machine with more RAM

Real-ESRGAN Issues

Error: Real-ESRGAN not available

Solution: Install the optional dependency:

pip install realesrgan

If issues persist, the library will gracefully fall back to interpolation-based upscaling.

Development Guide

Running Tests

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=document_preprocessor --cov-report=html

# Run specific test file
pytest tests/test_splitter.py

Code Style

# Format code with black
black document_preprocessor/

# Lint with ruff
ruff check document_preprocessor/

# Type check with mypy
mypy document_preprocessor/

Project Structure

document-preprocessor/
├── pyproject.toml
├── README.md
├── document_preprocessor/
│   ├── __init__.py
│   ├── config.py
│   ├── models.py
│   ├── exceptions.py
│   ├── splitter.py
│   ├── renderer.py
│   ├── enhancer.py
│   ├── classifier.py
│   ├── complexity.py
│   ├── dedup.py
│   └── pipeline.py
├── tests/
│   ├── test_splitter.py
│   ├── test_renderer.py
│   ├── test_enhancer.py
│   ├── test_classifier.py
│   ├── test_complexity.py
│   ├── test_dedup.py
│   └── test_pipeline.py
└── docs/

Design Principles

  • SOLID - Single responsibility, open/closed, Liskov substitution, interface segregation, dependency inversion
  • Clean Architecture - Separation of concerns, dependency injection
  • Domain-Driven Design - Rich domain models, ubiquitous language
  • Async-First - Non-blocking operations for high throughput
  • High Performance - ThreadPoolExecutor for parallel processing, lazy image loading
  • Memory Efficient - Streaming operations, temporary file cleanup
  • Type Safety - Complete type hints, mypy validation
  • Pydantic Validation - Strict data validation with ConfigDict
  • Extensible - Plugin architecture for custom classifiers and enhancers
  • Testable - Dependency injection, unit tests for all components
  • Production Observability - Structured logging, error tracking

Dependencies

Internal

  • document-core - Shared models, enums, interfaces, and utilities

External

  • PyMuPDF - PDF processing and rendering
  • pdf2image - PDF to image conversion
  • opencv-python-headless - Image enhancement
  • Pillow - Image I/O
  • pydantic - Data validation

Optional

  • realesrgan - AI-based image upscaling

License

MIT License - PepsiCo

Support

For issues, questions, or contributions, please contact the PepsiCo AI Team.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pepsico_document_preprocessor-0.1.0.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pepsico_document_preprocessor-0.1.0-py3-none-any.whl (25.4 kB view details)

Uploaded Python 3

File details

Details for the file pepsico_document_preprocessor-0.1.0.tar.gz.

File metadata

File hashes

Hashes for pepsico_document_preprocessor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b528aaebdea4746a98d361cc58dc0754abecde769bfd6402a1ddfe4a4acc7723
MD5 097b0c6e59663f0939b9b9cf9c77a6ef
BLAKE2b-256 32188e03385baddd65f8d3256fa5315cec9801ee087a2d4a524d6e7cad08f046

See more details on using hashes here.

File details

Details for the file pepsico_document_preprocessor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pepsico_document_preprocessor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 141acf04abea6324da5259894fc065a3aea20ac488e224b63d4d98270416aecd
MD5 c2376c4a43d1eaf5f30dda09e525f861
BLAKE2b-256 9cc2e2f25e1c48bc21b58c83f1872b0726c89a6c668dbf3603657fbd2272b54b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page