Docling pipeline optimizer - Pre-filter PDFs to skip unnecessary OCR and route to optimal backends

These details have not been verified by PyPI

Project description

PDF OCR Confidence

Optimize your Docling pipeline at scale — Pre-filter PDFs to skip unnecessary OCR processing and route documents to the right backend.

What It Does

Analyzes PDF documents before expensive OCR processing to:

✅ Skip Docling entirely for native text PDFs (10x faster)
✅ Route to optimal OCR backend (Tesseract vs EasyOCR)
✅ Batch triage 10K+ PDFs into priority queues
✅ Estimate processing time/cost before committing resources

Why Use This?

Problem: Docling initialization takes 2-3 seconds per PDF. For native text documents, that's pure overhead.

Solution: Pre-analyze PDFs in milliseconds, extract native text directly, and only use Docling when necessary.

Performance Gains

Processing 1000 mixed-quality PDFs:

Method	Time	Cost
Without pre-filtering	50 min	100%
With pre-filtering	23 min	54% faster ✅

Assumes 60% native text, 30% high-quality scans, 10% low-quality scans

Installation

pip install pdf-ocr-confidence

Quick Start: Docling Integration

1. Fast-path for Native Text PDFs

from pdf_ocr_confidence import should_use_docling, extract_native_text

if not should_use_docling("report.pdf"):
    # Skip Docling - extract text directly (10x faster)
    text = extract_native_text("report.pdf")
else:
    # Use Docling for scanned/low-quality PDFs
    from docling.document_converter import DocumentConverter
    doc = DocumentConverter().convert("report.pdf")
    text = doc.export_to_markdown()

2. Batch Triage & Optimal Routing

from pdf_ocr_confidence import get_docling_strategy

# Analyze each PDF
strategy = get_docling_strategy("invoice.pdf")

if not strategy["use_docling"]:
    # Native text - skip Docling
    text = extract_native_text("invoice.pdf")
elif strategy["ocr_backend"] == "tesseract":
    # High quality - use fast OCR
    doc = DocumentConverter(
        pipeline_options=PipelineOptions(
            do_ocr=True,
            ocr_backend="tesseract"
        )
    ).convert("invoice.pdf")
else:
    # Low quality - use robust OCR
    doc = DocumentConverter(
        pipeline_options=PipelineOptions(
            do_ocr=True,
            ocr_backend="easyocr"
        )
    ).convert("invoice.pdf")

3. Cost Estimation Before Processing

from pdf_ocr_confidence import estimate_processing_time

estimate = estimate_processing_time("large_doc.pdf")
print(f"Expected time: {estimate['time_seconds']}s")
print(f"Recommended backend: {estimate['recommended_backend']}")
print(f"Page count: {estimate['page_count']}")

Advanced Usage

Standalone Analysis (Without Docling)

from pdf_ocr_confidence import analyze_pdf

result = analyze_pdf("document.pdf")

print(f"Confidence: {result.confidence:.2f}")
print(f"Recommendation: {result.recommendation}")

# Per-page details
for page in result.pages:
    print(f"Page {page.number}: {page.confidence:.2f}")

Custom Thresholds

from pdf_ocr_confidence import analyze_pdf, ConfidenceConfig

config = ConfidenceConfig(
    expensive_ocr_threshold=0.4,  # Below this = expensive OCR
    cheap_ocr_threshold=0.7,      # Above this = cheap OCR
    sample_pages=5,               # Pages to analyze (None = all)
    min_dpi=150,                  # Minimum acceptable DPI
)

result = analyze_pdf("document.pdf", config=config)

How It Works

Native Text Detection: Checks if PDF has extractable text layer
Image Quality Analysis:
- DPI/Resolution check
- Blur detection (Laplacian variance)
- Contrast analysis (histogram)
- Edge density (text clarity)
Confidence Scoring: Weighted combination of metrics
Routing Recommendation: Native text / Tesseract / EasyOCR

Real-World Example

from pdf_ocr_confidence import get_docling_strategy, extract_native_text

def process_pdf_batch(pdf_paths):
    native_count = 0
    docling_count = 0
    
    for pdf_path in pdf_paths:
        strategy = get_docling_strategy(pdf_path)
        
        if not strategy["use_docling"]:
            # Fast path: skip Docling
            text = extract_native_text(pdf_path)
            native_count += 1
        else:
            # Use Docling with optimal backend
            doc = DocumentConverter().convert(pdf_path)
            text = doc.export_to_markdown()
            docling_count += 1
    
    print(f"Processed {native_count} PDFs without Docling (fast)")
    print(f"Processed {docling_count} PDFs with Docling (OCR)")

Result: If 60% of your PDFs have native text, you save 60% of Docling initialization overhead.

Examples

See examples/docling_integration.py for:

Complete pipeline integration
Batch processing with queues
Cost estimation
Priority-based routing

Run it:

python examples/docling_integration.py your_document.pdf

Use Cases

✅ Large-scale document processing (thousands of PDFs)
✅ Mixed-quality document pipelines (invoices, reports, scans)
✅ Cost optimization for cloud OCR services
✅ Pre-filtering before Docling/Tesseract/EasyOCR
✅ Queue-based batch processing

❌ Small batches (< 100 PDFs) — overhead not worth it
❌ All scanned documents — if everything needs OCR, skip pre-analysis
❌ Single OCR backend — if you only use Tesseract, limited benefit

Performance

Operation	Time (avg)
Native text detection	0.05s per page
Quality analysis	0.1s per page
Docling initialization	2-3s per PDF

Breakeven point: If >30% of your PDFs have native text, pre-filtering is worth it.

Dependencies

PyMuPDF (fitz) - PDF text extraction and rendering
Pillow - Image processing
numpy - Numerical operations
opencv-python - Advanced image analysis

Roadmap

Batch API for parallel analysis
Rotation/skew detection
Language detection for OCR model selection
Table/form detection heuristics
Integration examples for AWS Textract, Google Vision

Contributing

PRs welcome! Focus areas:

Better quality heuristics
Docling integration patterns
Performance benchmarks
Real-world use cases

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Apr 6, 2026

0.1.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_ocr_confidence-0.2.0-py3-none-any.whl (14.4 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file pdf_ocr_confidence-0.2.0-py3-none-any.whl.

File metadata

Download URL: pdf_ocr_confidence-0.2.0-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 14.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.1.dev29+g03adccea5 CPython/3.11.4

File hashes

Hashes for pdf_ocr_confidence-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`55ff9b89bdb5b99982948b7252207c118893914e42382b84561b49e7bff082b4`
MD5	`a65dff99345ea29d52bc9754e1f8a70e`
BLAKE2b-256	`d63e58b03ce65aab4f254f53e594dc214d204994213049b28366b3c4f8cfdfeb`

See more details on using hashes here.

pdf-ocr-confidence 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers