Skip to main content

Docling pipeline optimizer - Pre-filter PDFs to skip unnecessary OCR and route to optimal backends

Project description

PDF OCR Confidence

Optimize your Docling pipeline at scale — Pre-filter PDFs to skip unnecessary OCR processing and route documents to the right backend.

What It Does

Analyzes PDF documents before expensive OCR processing to:

  • Skip Docling entirely for native text PDFs (10x faster)
  • Route to optimal OCR backend (Tesseract vs EasyOCR)
  • Batch triage 10K+ PDFs into priority queues
  • Estimate processing time/cost before committing resources

Why Use This?

Problem: Docling initialization takes 2-3 seconds per PDF. For native text documents, that's pure overhead.

Solution: Pre-analyze PDFs in milliseconds, extract native text directly, and only use Docling when necessary.

Performance Gains

Processing 1000 mixed-quality PDFs:

Method Time Cost
Without pre-filtering 50 min 100%
With pre-filtering 23 min 54% faster ✅

Assumes 60% native text, 30% high-quality scans, 10% low-quality scans

Installation

pip install pdf-ocr-confidence

Quick Start: Docling Integration

1. Fast-path for Native Text PDFs

from pdf_ocr_confidence import should_use_docling, extract_native_text

if not should_use_docling("report.pdf"):
    # Skip Docling - extract text directly (10x faster)
    text = extract_native_text("report.pdf")
else:
    # Use Docling for scanned/low-quality PDFs
    from docling.document_converter import DocumentConverter
    doc = DocumentConverter().convert("report.pdf")
    text = doc.export_to_markdown()

2. Batch Triage & Optimal Routing

from pdf_ocr_confidence import get_docling_strategy

# Analyze each PDF
strategy = get_docling_strategy("invoice.pdf")

if not strategy["use_docling"]:
    # Native text - skip Docling
    text = extract_native_text("invoice.pdf")
elif strategy["ocr_backend"] == "tesseract":
    # High quality - use fast OCR
    doc = DocumentConverter(
        pipeline_options=PipelineOptions(
            do_ocr=True,
            ocr_backend="tesseract"
        )
    ).convert("invoice.pdf")
else:
    # Low quality - use robust OCR
    doc = DocumentConverter(
        pipeline_options=PipelineOptions(
            do_ocr=True,
            ocr_backend="easyocr"
        )
    ).convert("invoice.pdf")

3. Cost Estimation Before Processing

from pdf_ocr_confidence import estimate_processing_time

estimate = estimate_processing_time("large_doc.pdf")
print(f"Expected time: {estimate['time_seconds']}s")
print(f"Recommended backend: {estimate['recommended_backend']}")
print(f"Page count: {estimate['page_count']}")

Advanced Usage

Standalone Analysis (Without Docling)

from pdf_ocr_confidence import analyze_pdf

result = analyze_pdf("document.pdf")

print(f"Confidence: {result.confidence:.2f}")
print(f"Recommendation: {result.recommendation}")

# Per-page details
for page in result.pages:
    print(f"Page {page.number}: {page.confidence:.2f}")

Custom Thresholds

from pdf_ocr_confidence import analyze_pdf, ConfidenceConfig

config = ConfidenceConfig(
    expensive_ocr_threshold=0.4,  # Below this = expensive OCR
    cheap_ocr_threshold=0.7,      # Above this = cheap OCR
    sample_pages=5,               # Pages to analyze (None = all)
    min_dpi=150,                  # Minimum acceptable DPI
)

result = analyze_pdf("document.pdf", config=config)

How It Works

  1. Native Text Detection: Checks if PDF has extractable text layer
  2. Image Quality Analysis:
    • DPI/Resolution check
    • Blur detection (Laplacian variance)
    • Contrast analysis (histogram)
    • Edge density (text clarity)
  3. Confidence Scoring: Weighted combination of metrics
  4. Routing Recommendation: Native text / Tesseract / EasyOCR

Real-World Example

from pdf_ocr_confidence import get_docling_strategy, extract_native_text

def process_pdf_batch(pdf_paths):
    native_count = 0
    docling_count = 0
    
    for pdf_path in pdf_paths:
        strategy = get_docling_strategy(pdf_path)
        
        if not strategy["use_docling"]:
            # Fast path: skip Docling
            text = extract_native_text(pdf_path)
            native_count += 1
        else:
            # Use Docling with optimal backend
            doc = DocumentConverter().convert(pdf_path)
            text = doc.export_to_markdown()
            docling_count += 1
    
    print(f"Processed {native_count} PDFs without Docling (fast)")
    print(f"Processed {docling_count} PDFs with Docling (OCR)")

Result: If 60% of your PDFs have native text, you save 60% of Docling initialization overhead.

Examples

See examples/docling_integration.py for:

  • Complete pipeline integration
  • Batch processing with queues
  • Cost estimation
  • Priority-based routing

Run it:

python examples/docling_integration.py your_document.pdf

Use Cases

Large-scale document processing (thousands of PDFs)
Mixed-quality document pipelines (invoices, reports, scans)
Cost optimization for cloud OCR services
Pre-filtering before Docling/Tesseract/EasyOCR
Queue-based batch processing

Small batches (< 100 PDFs) — overhead not worth it
All scanned documents — if everything needs OCR, skip pre-analysis
Single OCR backend — if you only use Tesseract, limited benefit

Performance

Operation Time (avg)
Native text detection 0.05s per page
Quality analysis 0.1s per page
Docling initialization 2-3s per PDF

Breakeven point: If >30% of your PDFs have native text, pre-filtering is worth it.

Dependencies

  • PyMuPDF (fitz) - PDF text extraction and rendering
  • Pillow - Image processing
  • numpy - Numerical operations
  • opencv-python - Advanced image analysis

Roadmap

  • Batch API for parallel analysis
  • Rotation/skew detection
  • Language detection for OCR model selection
  • Table/form detection heuristics
  • Integration examples for AWS Textract, Google Vision

Contributing

PRs welcome! Focus areas:

  • Better quality heuristics
  • Docling integration patterns
  • Performance benchmarks
  • Real-world use cases

License

MIT

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_ocr_confidence-0.2.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file pdf_ocr_confidence-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pdf_ocr_confidence-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.1.dev29+g03adccea5 CPython/3.11.4

File hashes

Hashes for pdf_ocr_confidence-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 55ff9b89bdb5b99982948b7252207c118893914e42382b84561b49e7bff082b4
MD5 a65dff99345ea29d52bc9754e1f8a70e
BLAKE2b-256 d63e58b03ce65aab4f254f53e594dc214d204994213049b28366b3c4f8cfdfeb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page