Estimate OCR reading confidence for PDF files before running expensive OCR
Project description
PDF OCR Confidence
A Python library to estimate OCR reading confidence for PDF files before running expensive OCR models.
What It Does
Analyzes PDF documents and returns a confidence score (0-1) indicating how well OCR will perform, helping you:
- Route low-quality PDFs to expensive/robust OCR models
- Route high-quality PDFs to cheap/fast OCR models
- Skip OCR entirely for native text PDFs
Features
- ✅ Detects native text layer (instant 100% confidence, skip OCR)
- ✅ Image quality analysis (resolution, blur, contrast, noise)
- ✅ Text region detection and edge analysis
- ✅ Per-page and document-level confidence scores
- ✅ Configurable thresholds for routing decisions
Installation
pip install -e .
Quick Start
from pdf_ocr_confidence import analyze_pdf
# Analyze a PDF
result = analyze_pdf("document.pdf")
print(f"Confidence: {result.confidence:.2f}")
print(f"Recommendation: {result.recommendation}") # "native", "cheap_ocr", "expensive_ocr"
# Access per-page details
for page in result.pages:
print(f"Page {page.number}: {page.confidence:.2f} - {page.quality_metrics}")
How It Works
- Native Text Detection: Checks if PDF has extractable text layer
- Image Quality Analysis:
- DPI/Resolution check
- Blur detection (Laplacian variance)
- Contrast analysis (histogram)
- Edge density (text clarity)
- Confidence Scoring: Weighted combination of metrics
- Routing Recommendation: Based on configurable thresholds
Configuration
from pdf_ocr_confidence import analyze_pdf, ConfidenceConfig
config = ConfidenceConfig(
expensive_ocr_threshold=0.4, # Below this = expensive OCR
cheap_ocr_threshold=0.7, # Above this = cheap OCR
sample_pages=3, # Pages to analyze (None = all)
min_dpi=150, # Minimum acceptable DPI
)
result = analyze_pdf("document.pdf", config=config)
Dependencies
PyMuPDF(fitz) - PDF text extraction and renderingPillow- Image processingnumpy- Numerical operationsopencv-python- Advanced image analysis
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_ocr_confidence-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pdf_ocr_confidence-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.1.dev29+g03adccea5 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c9fb79610241107501c0b67cae01c940c9faa8b240ec636397b8e97dbc7ab77
|
|
| MD5 |
e4da35124453d9bbfe08da00f89a588c
|
|
| BLAKE2b-256 |
b3877f2a6f1b868fde8226713e93fbc95a7b77950edfea9456670d67be028322
|