Docling pipeline optimizer - Pre-filter PDFs to skip unnecessary OCR and route to optimal backends
Project description
PDF OCR Confidence
Optimize your Docling pipeline at scale — Pre-filter PDFs to skip unnecessary OCR processing and route documents to the right backend.
What It Does
Analyzes PDF documents before expensive OCR processing to:
- ✅ Skip Docling entirely for native text PDFs (10x faster)
- ✅ Route to optimal OCR backend (Tesseract vs EasyOCR)
- ✅ Batch triage 10K+ PDFs into priority queues
- ✅ Estimate processing time/cost before committing resources
Why Use This?
Problem: Docling initialization takes 2-3 seconds per PDF. For native text documents, that's pure overhead.
Solution: Pre-analyze PDFs in milliseconds, extract native text directly, and only use Docling when necessary.
Performance Gains
Processing 1000 mixed-quality PDFs:
| Method | Time | Cost |
|---|---|---|
| Without pre-filtering | 50 min | 100% |
| With pre-filtering | 23 min | 54% faster ✅ |
Assumes 60% native text, 30% high-quality scans, 10% low-quality scans
Installation
pip install pdf-ocr-confidence
Quick Start: Docling Integration
1. Fast-path for Native Text PDFs
from pdf_ocr_confidence import should_use_docling, extract_native_text
if not should_use_docling("report.pdf"):
# Skip Docling - extract text directly (10x faster)
text = extract_native_text("report.pdf")
else:
# Use Docling for scanned/low-quality PDFs
from docling.document_converter import DocumentConverter
doc = DocumentConverter().convert("report.pdf")
text = doc.export_to_markdown()
2. Batch Triage & Optimal Routing
from pdf_ocr_confidence import get_docling_strategy
# Analyze each PDF
strategy = get_docling_strategy("invoice.pdf")
if not strategy["use_docling"]:
# Native text - skip Docling
text = extract_native_text("invoice.pdf")
elif strategy["ocr_backend"] == "tesseract":
# High quality - use fast OCR
doc = DocumentConverter(
pipeline_options=PipelineOptions(
do_ocr=True,
ocr_backend="tesseract"
)
).convert("invoice.pdf")
else:
# Low quality - use robust OCR
doc = DocumentConverter(
pipeline_options=PipelineOptions(
do_ocr=True,
ocr_backend="easyocr"
)
).convert("invoice.pdf")
3. Cost Estimation Before Processing
from pdf_ocr_confidence import estimate_processing_time
estimate = estimate_processing_time("large_doc.pdf")
print(f"Expected time: {estimate['time_seconds']}s")
print(f"Recommended backend: {estimate['recommended_backend']}")
print(f"Page count: {estimate['page_count']}")
Advanced Usage
Standalone Analysis (Without Docling)
from pdf_ocr_confidence import analyze_pdf
result = analyze_pdf("document.pdf")
print(f"Confidence: {result.confidence:.2f}")
print(f"Recommendation: {result.recommendation}")
# Per-page details
for page in result.pages:
print(f"Page {page.number}: {page.confidence:.2f}")
Custom Thresholds
from pdf_ocr_confidence import analyze_pdf, ConfidenceConfig
config = ConfidenceConfig(
expensive_ocr_threshold=0.4, # Below this = expensive OCR
cheap_ocr_threshold=0.7, # Above this = cheap OCR
sample_pages=5, # Pages to analyze (None = all)
min_dpi=150, # Minimum acceptable DPI
)
result = analyze_pdf("document.pdf", config=config)
How It Works
- Native Text Detection: Checks if PDF has extractable text layer
- Image Quality Analysis:
- DPI/Resolution check
- Blur detection (Laplacian variance)
- Contrast analysis (histogram)
- Edge density (text clarity)
- Confidence Scoring: Weighted combination of metrics
- Routing Recommendation: Native text / Tesseract / EasyOCR
Real-World Example
from pdf_ocr_confidence import get_docling_strategy, extract_native_text
def process_pdf_batch(pdf_paths):
native_count = 0
docling_count = 0
for pdf_path in pdf_paths:
strategy = get_docling_strategy(pdf_path)
if not strategy["use_docling"]:
# Fast path: skip Docling
text = extract_native_text(pdf_path)
native_count += 1
else:
# Use Docling with optimal backend
doc = DocumentConverter().convert(pdf_path)
text = doc.export_to_markdown()
docling_count += 1
print(f"Processed {native_count} PDFs without Docling (fast)")
print(f"Processed {docling_count} PDFs with Docling (OCR)")
Result: If 60% of your PDFs have native text, you save 60% of Docling initialization overhead.
Examples
See examples/docling_integration.py for:
- Complete pipeline integration
- Batch processing with queues
- Cost estimation
- Priority-based routing
Run it:
python examples/docling_integration.py your_document.pdf
Use Cases
✅ Large-scale document processing (thousands of PDFs)
✅ Mixed-quality document pipelines (invoices, reports, scans)
✅ Cost optimization for cloud OCR services
✅ Pre-filtering before Docling/Tesseract/EasyOCR
✅ Queue-based batch processing
❌ Small batches (< 100 PDFs) — overhead not worth it
❌ All scanned documents — if everything needs OCR, skip pre-analysis
❌ Single OCR backend — if you only use Tesseract, limited benefit
Performance
| Operation | Time (avg) |
|---|---|
| Native text detection | 0.05s per page |
| Quality analysis | 0.1s per page |
| Docling initialization | 2-3s per PDF |
Breakeven point: If >30% of your PDFs have native text, pre-filtering is worth it.
Dependencies
PyMuPDF(fitz) - PDF text extraction and renderingPillow- Image processingnumpy- Numerical operationsopencv-python- Advanced image analysis
Roadmap
- Batch API for parallel analysis
- Rotation/skew detection
- Language detection for OCR model selection
- Table/form detection heuristics
- Integration examples for AWS Textract, Google Vision
Contributing
PRs welcome! Focus areas:
- Better quality heuristics
- Docling integration patterns
- Performance benchmarks
- Real-world use cases
License
MIT
Links
- PyPI: https://pypi.org/project/pdf-ocr-confidence/
- GitHub: (coming soon)
- Docling: https://github.com/DS4SD/docling
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_ocr_confidence-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pdf_ocr_confidence-0.2.0-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.1.dev29+g03adccea5 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55ff9b89bdb5b99982948b7252207c118893914e42382b84561b49e7bff082b4
|
|
| MD5 |
a65dff99345ea29d52bc9754e1f8a70e
|
|
| BLAKE2b-256 |
d63e58b03ce65aab4f254f53e594dc214d204994213049b28366b3c4f8cfdfeb
|