Document preprocessing library for PDF ingestion, rendering, enhancement, and classification
Project description
document-preprocessor
A production-ready document preprocessing library for PDF ingestion, rendering, enhancement, and classification. This library provides the complete preprocessing lifecycle for PDF documents before OCR, vision analysis, extraction, and AI processing.
Overview
document-preprocessor is a core component within a larger Document Intelligence platform. It handles:
- PDF ingestion and page splitting
- High-resolution page rendering
- Image enhancement (deskewing, contrast, noise reduction, upscaling)
- Page classification for routing
- Document complexity analysis
- Content deduplication
- Complete preprocessing orchestration
Architecture
The library follows Clean Architecture principles with SOLID design, Domain-Driven Design, dependency injection, and async-first APIs.
Core Components
PdfSplitter- Splits PDFs into document-core Page objectsPageRenderer- Renders PDF pages to high-resolution imagesImageEnhancer- Enhances images with deskewing, contrast, noise reduction, and optional upscalingPageClassifier- Classifies pages for routing (planogram, table, cover, appendix, unknown)ComplexityAnalyzer- Analyzes document complexity to recommend processing modeContentDeduplicator- Detects and removes duplicate pagesPreprocessorPipeline- Orchestrates the complete preprocessing workflow
Processing Flow
PDF Input
↓
Phase 1: Split PDF → Pages
↓
Phase 2: Deduplicate Pages
↓
Phase 3: Render Pages → Images
↓
Phase 4: Enhance Images
↓
Phase 5: Classify Pages
↓
Phase 6: Compute Complexity
↓
Phase 7: Cache (optional)
↓
PreprocessResult
Installation
Requirements
- Python >= 3.11
- document-core >= 0.1.0
- PyMuPDF >= 1.24
- pdf2image >= 1.17
- opencv-python-headless >= 4.9
- Pillow >= 10.0
- pydantic >= 2.0
Optional Dependencies
realesrgan- For AI-based image upscaling (install withpip install document-preprocessor[upscale])
Install from Source
cd document-preprocessor
pip install -e .
Install with Optional Dependencies
pip install -e ".[upscale,dev]"
Configuration
from document_preprocessor import PreprocessorConfig
config = PreprocessorConfig(
render_dpi=300,
image_format="png",
temp_directory="/tmp/document-preprocessor",
enable_parallel_rendering=True,
enable_parallel_enhancement=True,
enable_deduplication=True,
cache_enabled=True,
classification_confidence_threshold=0.80,
complexity_simple_threshold=25,
complexity_standard_threshold=60,
max_workers=8,
)
# Or load from environment variables
config = PreprocessorConfig.from_env()
Environment Variables
PREPROCESSOR_RENDER_DPI- Rendering DPI (default: 300)PREPROCESSOR_IMAGE_FORMAT- Output image format (default: png)PREPROCESSOR_TEMP_DIR- Temporary directory (default: /tmp/document-preprocessor)PREPROCESSOR_PARALLEL_RENDER- Enable parallel rendering (default: true)PREPROCESSOR_PARALLEL_ENHANCE- Enable parallel enhancement (default: true)PREPROCESSOR_MAX_WORKERS- Maximum worker threads (default: 8)PREPROCESSOR_ENABLE_DEDUP- Enable deduplication (default: true)PREPROCESSOR_CACHE_ENABLED- Enable caching (default: true)PREPROCESSOR_CLASSIFICATION_THRESHOLD- Classification confidence threshold (default: 0.80)PREPROCESSOR_COMPLEXITY_SIMPLE- Simple complexity threshold (default: 25)PREPROCESSOR_COMPLEXITY_STANDARD- Standard complexity threshold (default: 60)
Usage
Basic Pipeline Usage
import asyncio
from document_preprocessor import (
PreprocessorPipeline,
PdfSplitter,
PageRenderer,
ImageEnhancer,
PageClassifier,
ComplexityAnalyzer,
PreprocessorConfig,
)
async def process_pdf(pdf_path: str):
# Initialize components
config = PreprocessorConfig()
splitter = PdfSplitter()
renderer = PageRenderer(dpi=config.render_dpi, image_format=config.image_format)
enhancer = ImageEnhancer(temp_directory=config.temp_directory)
classifier = PageClassifier(confidence_threshold=config.classification_confidence_threshold)
analyzer = ComplexityAnalyzer(
simple_threshold=config.complexity_simple_threshold,
standard_threshold=config.complexity_standard_threshold,
)
# Create pipeline
pipeline = PreprocessorPipeline(
splitter=splitter,
renderer=renderer,
enhancer=enhancer,
classifier=classifier,
analyzer=analyzer,
config=config,
)
# Process PDF
result = await pipeline.process(pdf_path)
# Access results
print(f"Processed {len(result.document.pages)} pages")
print(f"Complexity: {result.complexity.overall_score:.1f}")
print(f"Recommended mode: {result.complexity.recommended_mode}")
print(f"Reasoning: {result.complexity.reasoning}")
return result
# Run pipeline
result = asyncio.run(process_pdf("document.pdf"))
Individual Component Usage
PDF Splitting
from document_preprocessor import PdfSplitter
splitter = PdfSplitter()
pages = splitter.split("document.pdf")
# Process in batches
batches = splitter.split_to_batches("document.pdf", batch_size=10)
Page Rendering
from document_preprocessor import PageRenderer
renderer = PageRenderer(dpi=300, image_format="png")
image_path = renderer.render(page)
# Batch rendering
image_paths = renderer.render_batch(pages, parallel=True)
Image Enhancement
from document_preprocessor import ImageEnhancer, EnhancerConfig
config = EnhancerConfig(
enable_deskew=True,
enable_contrast=True,
enable_upscale=True,
enable_binarization=True,
)
enhancer = ImageEnhancer(config=config)
enhanced_path = enhancer.enhance(image_path, current_dpi=150)
Page Classification
from document_preprocessor import PageClassifier
classifier = PageClassifier(confidence_threshold=0.80)
page_type = classifier.classify(page)
# Batch classification
classifications = classifier.classify_batch(pages)
Complexity Analysis
from document_preprocessor import ComplexityAnalyzer
analyzer = ComplexityAnalyzer(simple_threshold=25, standard_threshold=60)
complexity = analyzer.score_document(pages)
print(f"Overall score: {complexity.overall_score}")
print(f"Recommended mode: {complexity.recommended_mode}")
print(f"Reasoning: {complexity.reasoning}")
Content Deduplication
from document_preprocessor import ContentDeduplicator
deduplicator = ContentDeduplicator()
# Find duplicates
duplicates = deduplicator.find_duplicates(pages)
# Remove duplicates
deduplicated_pages = deduplicator.remove_duplicates(pages)
Page Classification Rules
The classifier uses heuristic rules to categorize pages:
- PLANOGRAM:
image_area_ratio > 0.60 - TABLE:
detected_table_regions > 2andimage_area_ratio < 0.30 - COVER:
page_number == 1andraw_char_count < 500 - APPENDIX: Detected via keyword analysis (appendix, glossary, references, notes)
- UNKNOWN: Fallback for unclassified pages
Complexity Scoring
Complexity is scored on a scale of 0-100 across three dimensions:
Layout Score
- Image density (40 points)
- Shelf regions (30 points)
- Region count (20 points)
- Mixed layout penalty (10 points)
OCR Score
- Small text ratio (40 points)
- Rotation (30 points)
- Dense annotations (30 points)
Structure Score
- Table regions (50 points)
- Nested layouts (30 points)
- Page position (20 points)
Mode Selection
- FAST: Overall score < 25
- BALANCED: Overall score < 60
- HIGH_ACCURACY: Overall score >= 60
Performance Tuning
Parallel Processing
Enable parallel rendering and enhancement for large documents:
config = PreprocessorConfig(
enable_parallel_rendering=True,
enable_parallel_enhancement=True,
max_workers=16, # Adjust based on CPU cores
)
Memory Management
For very large PDFs (1000+ pages):
- Process in batches using
split_to_batches() - Increase temp directory size
- Monitor memory usage
- Use cache to avoid reprocessing
Upscaling
Enable AI-based upscaling for low-DPI documents:
config = EnhancerConfig(
enable_upscale=True,
upscale_threshold_dpi=150,
upscale_factor=2,
)
# Install optional dependency
pip install realesrgan
Extending Classification
Custom CNN Classifier
from document_preprocessor import PageClassifier
def custom_cnn_classifier(page: Page) -> PageType:
# Your CNN logic here
return PageType.PLANOGRAM
classifier = PageClassifier(
confidence_threshold=0.80,
classifier_model=custom_cnn_classifier,
)
Custom Classification Rules
Extend the PageClassifier class to add custom rules:
from document_preprocessor.classifier import PageClassifier
class CustomClassifier(PageClassifier):
def _classify_heuristic(self, page: Page) -> PageType:
# Add your custom logic
if page.metadata.image_area_ratio > 0.80:
return PageType.PLANOGRAM
return super()._classify_heuristic(page)
Troubleshooting
PDF Splitting Errors
Error: PdfSplitError: Failed to open PDF
Solution: Ensure the PDF file exists and is not corrupted. Verify file permissions.
Rendering Errors
Error: RenderingError: Failed to render page
Solution: Check that PyMuPDF is installed correctly. Verify the PDF is not password-protected.
Enhancement Errors
Error: EnhancementError: Image enhancement failed
Solution: Ensure OpenCV is installed. Check that the image file exists and is readable.
Memory Issues
Error: High memory usage with large PDFs
Solution:
- Process in batches
- Reduce DPI
- Enable deduplication to reduce page count
- Increase system memory or use a machine with more RAM
Real-ESRGAN Issues
Error: Real-ESRGAN not available
Solution: Install the optional dependency:
pip install realesrgan
If issues persist, the library will gracefully fall back to interpolation-based upscaling.
Development Guide
Running Tests
# Run all tests
pytest tests/
# Run with coverage
pytest tests/ --cov=document_preprocessor --cov-report=html
# Run specific test file
pytest tests/test_splitter.py
Code Style
# Format code with black
black document_preprocessor/
# Lint with ruff
ruff check document_preprocessor/
# Type check with mypy
mypy document_preprocessor/
Project Structure
document-preprocessor/
├── pyproject.toml
├── README.md
├── document_preprocessor/
│ ├── __init__.py
│ ├── config.py
│ ├── models.py
│ ├── exceptions.py
│ ├── splitter.py
│ ├── renderer.py
│ ├── enhancer.py
│ ├── classifier.py
│ ├── complexity.py
│ ├── dedup.py
│ └── pipeline.py
├── tests/
│ ├── test_splitter.py
│ ├── test_renderer.py
│ ├── test_enhancer.py
│ ├── test_classifier.py
│ ├── test_complexity.py
│ ├── test_dedup.py
│ └── test_pipeline.py
└── docs/
Design Principles
- SOLID - Single responsibility, open/closed, Liskov substitution, interface segregation, dependency inversion
- Clean Architecture - Separation of concerns, dependency injection
- Domain-Driven Design - Rich domain models, ubiquitous language
- Async-First - Non-blocking operations for high throughput
- High Performance - ThreadPoolExecutor for parallel processing, lazy image loading
- Memory Efficient - Streaming operations, temporary file cleanup
- Type Safety - Complete type hints, mypy validation
- Pydantic Validation - Strict data validation with ConfigDict
- Extensible - Plugin architecture for custom classifiers and enhancers
- Testable - Dependency injection, unit tests for all components
- Production Observability - Structured logging, error tracking
Dependencies
Internal
document-core- Shared models, enums, interfaces, and utilities
External
PyMuPDF- PDF processing and renderingpdf2image- PDF to image conversionopencv-python-headless- Image enhancementPillow- Image I/Opydantic- Data validation
Optional
realesrgan- AI-based image upscaling
License
MIT License - PepsiCo
Support
For issues, questions, or contributions, please contact the PepsiCo AI Team.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pepsico_document_preprocessor-0.1.0.tar.gz.
File metadata
- Download URL: pepsico_document_preprocessor-0.1.0.tar.gz
- Upload date:
- Size: 27.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b528aaebdea4746a98d361cc58dc0754abecde769bfd6402a1ddfe4a4acc7723
|
|
| MD5 |
097b0c6e59663f0939b9b9cf9c77a6ef
|
|
| BLAKE2b-256 |
32188e03385baddd65f8d3256fa5315cec9801ee087a2d4a524d6e7cad08f046
|
File details
Details for the file pepsico_document_preprocessor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pepsico_document_preprocessor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
141acf04abea6324da5259894fc065a3aea20ac488e224b63d4d98270416aecd
|
|
| MD5 |
c2376c4a43d1eaf5f30dda09e525f861
|
|
| BLAKE2b-256 |
9cc2e2f25e1c48bc21b58c83f1872b0726c89a6c668dbf3603657fbd2272b54b
|