Estimate OCR reading confidence for PDF files before running expensive OCR

These details have not been verified by PyPI

Project description

PDF OCR Confidence

A Python library to estimate OCR reading confidence for PDF files before running expensive OCR models.

What It Does

Analyzes PDF documents and returns a confidence score (0-1) indicating how well OCR will perform, helping you:

Route low-quality PDFs to expensive/robust OCR models
Route high-quality PDFs to cheap/fast OCR models
Skip OCR entirely for native text PDFs

Features

✅ Detects native text layer (instant 100% confidence, skip OCR)
✅ Image quality analysis (resolution, blur, contrast, noise)
✅ Text region detection and edge analysis
✅ Per-page and document-level confidence scores
✅ Configurable thresholds for routing decisions

Installation

pip install -e .

Quick Start

from pdf_ocr_confidence import analyze_pdf

# Analyze a PDF
result = analyze_pdf("document.pdf")

print(f"Confidence: {result.confidence:.2f}")
print(f"Recommendation: {result.recommendation}")  # "native", "cheap_ocr", "expensive_ocr"

# Access per-page details
for page in result.pages:
    print(f"Page {page.number}: {page.confidence:.2f} - {page.quality_metrics}")

How It Works

Native Text Detection: Checks if PDF has extractable text layer
Image Quality Analysis:
- DPI/Resolution check
- Blur detection (Laplacian variance)
- Contrast analysis (histogram)
- Edge density (text clarity)
Confidence Scoring: Weighted combination of metrics
Routing Recommendation: Based on configurable thresholds

Configuration

from pdf_ocr_confidence import analyze_pdf, ConfidenceConfig

config = ConfidenceConfig(
    expensive_ocr_threshold=0.4,  # Below this = expensive OCR
    cheap_ocr_threshold=0.7,      # Above this = cheap OCR
    sample_pages=3,               # Pages to analyze (None = all)
    min_dpi=150,                  # Minimum acceptable DPI
)

result = analyze_pdf("document.pdf", config=config)

Dependencies

PyMuPDF (fitz) - PDF text extraction and rendering
Pillow - Image processing
numpy - Numerical operations
opencv-python - Advanced image analysis

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Apr 6, 2026

This version

0.1.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_ocr_confidence-0.1.0-py3-none-any.whl (11.0 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file pdf_ocr_confidence-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdf_ocr_confidence-0.1.0-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 11.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.1.dev29+g03adccea5 CPython/3.11.4

File hashes

Hashes for pdf_ocr_confidence-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c9fb79610241107501c0b67cae01c940c9faa8b240ec636397b8e97dbc7ab77`
MD5	`e4da35124453d9bbfe08da00f89a588c`
BLAKE2b-256	`b3877f2a6f1b868fde8226713e93fbc95a7b77950edfea9456670d67be028322`

See more details on using hashes here.

pdf-ocr-confidence 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers