Skip to main content

Estimate OCR reading confidence for PDF files before running expensive OCR

Project description

PDF OCR Confidence

A Python library to estimate OCR reading confidence for PDF files before running expensive OCR models.

What It Does

Analyzes PDF documents and returns a confidence score (0-1) indicating how well OCR will perform, helping you:

  • Route low-quality PDFs to expensive/robust OCR models
  • Route high-quality PDFs to cheap/fast OCR models
  • Skip OCR entirely for native text PDFs

Features

  • ✅ Detects native text layer (instant 100% confidence, skip OCR)
  • ✅ Image quality analysis (resolution, blur, contrast, noise)
  • ✅ Text region detection and edge analysis
  • ✅ Per-page and document-level confidence scores
  • ✅ Configurable thresholds for routing decisions

Installation

pip install -e .

Quick Start

from pdf_ocr_confidence import analyze_pdf

# Analyze a PDF
result = analyze_pdf("document.pdf")

print(f"Confidence: {result.confidence:.2f}")
print(f"Recommendation: {result.recommendation}")  # "native", "cheap_ocr", "expensive_ocr"

# Access per-page details
for page in result.pages:
    print(f"Page {page.number}: {page.confidence:.2f} - {page.quality_metrics}")

How It Works

  1. Native Text Detection: Checks if PDF has extractable text layer
  2. Image Quality Analysis:
    • DPI/Resolution check
    • Blur detection (Laplacian variance)
    • Contrast analysis (histogram)
    • Edge density (text clarity)
  3. Confidence Scoring: Weighted combination of metrics
  4. Routing Recommendation: Based on configurable thresholds

Configuration

from pdf_ocr_confidence import analyze_pdf, ConfidenceConfig

config = ConfidenceConfig(
    expensive_ocr_threshold=0.4,  # Below this = expensive OCR
    cheap_ocr_threshold=0.7,      # Above this = cheap OCR
    sample_pages=3,               # Pages to analyze (None = all)
    min_dpi=150,                  # Minimum acceptable DPI
)

result = analyze_pdf("document.pdf", config=config)

Dependencies

  • PyMuPDF (fitz) - PDF text extraction and rendering
  • Pillow - Image processing
  • numpy - Numerical operations
  • opencv-python - Advanced image analysis

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_ocr_confidence-0.1.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file pdf_ocr_confidence-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdf_ocr_confidence-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.1.dev29+g03adccea5 CPython/3.11.4

File hashes

Hashes for pdf_ocr_confidence-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1c9fb79610241107501c0b67cae01c940c9faa8b240ec636397b8e97dbc7ab77
MD5 e4da35124453d9bbfe08da00f89a588c
BLAKE2b-256 b3877f2a6f1b868fde8226713e93fbc95a7b77950edfea9456670d67be028322

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page