A Python library to detect whether PDF pages contain extractable text or are scanned images requiring OCR
Project description
OCR Detection Library
A Python library to analyze PDF pages and determine whether they contain extractable text or are scanned images requiring OCR processing.
Features
- Page Type Detection: Automatically classifies PDF pages as text, scanned, mixed, or empty
- Parallel Processing: Fast analysis of large PDFs using multi-threading
- Confidence Scoring: Reliability indicators for classifications
- Simple API: Easy-to-use interface with minimal complexity
Installation
# Clone or download the project
cd ocr-detection
# Install with uv (recommended)
uv sync
# Or install with pip
pip install -e .
Usage
Quick Start
from ocr_detection import detect_ocr
# Analyze a PDF document
result = detect_ocr("document.pdf")
print(result)
# Output: {"status": "partial", "pages": [1, 3, 7, 12]}
# Check the status
if result['status'] == "true":
print("All pages need OCR")
elif result['status'] == "false":
print("No pages need OCR")
else: # partial
print(f"Pages needing OCR: {result['pages']}")
Using the OCRDetection Class
from ocr_detection import OCRDetection
# Initialize detector with options
detector = OCRDetection(
confidence_threshold=0.5, # Minimum confidence for OCR detection
parallel=True # Enable parallel processing
)
# Analyze a document
result = detector.detect("document.pdf")
# With custom parallel settings
result = detector.detect("large_document.pdf", max_workers=4)
Understanding Results
The library returns a simple dictionary with two fields:
-
status: Indicates the OCR requirement
"true"- All pages need OCR processing"false"- No pages need OCR processing"partial"- Some pages need OCR processing
-
pages: List of page numbers (1-indexed) that need OCR processing
- Empty list when status is
"false" - Contains all page numbers when status is
"true" - Contains specific page numbers when status is
"partial"
- Empty list when status is
Examples
from ocr_detection import detect_ocr
# Example 1: Fully text-based PDF
result = detect_ocr("text_document.pdf")
# {"status": "false", "pages": []}
# Example 2: Scanned PDF
result = detect_ocr("scanned_document.pdf")
# {"status": "true", "pages": [1, 2, 3, 4, 5]}
# Example 3: Mixed content PDF
result = detect_ocr("mixed_document.pdf")
# {"status": "partial", "pages": [2, 5, 8]}
# Example 4: With parallel processing for large PDFs
result = detect_ocr("large_document.pdf", parallel=True)
Performance
The library automatically optimizes performance based on document size:
- Documents with ≤10 pages use sequential processing
- Larger documents use parallel processing with configurable worker threads
- Parallel processing provides 3-8x performance improvement for large documents
License
MIT License - see LICENSE file for details
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ocr_detection-0.2.0.tar.gz.
File metadata
- Download URL: ocr_detection-0.2.0.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3078f510fcf57beab8b8e6f24baf359ad5e17932fb346c0d27b0a1795635404
|
|
| MD5 |
8c5b32dfcf63860bc970ba8d745560c9
|
|
| BLAKE2b-256 |
34e7478d146f03caa0567c953610670568ff14305011c3f78741d1293342ce43
|
File details
Details for the file ocr_detection-0.2.0-py3-none-any.whl.
File metadata
- Download URL: ocr_detection-0.2.0-py3-none-any.whl
- Upload date:
- Size: 15.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07f10665ff54fa5d68718b7e144f454bbf63f1f75ad5f7abd63e88724e55918a
|
|
| MD5 |
59daaa14545f47d1841f824abb00c14a
|
|
| BLAKE2b-256 |
0314f7b627629581471d75b40d4a105db72f14b8487b865d250048ebb6b99c5a
|