A Python library to detect whether PDF pages contain extractable text or are scanned images requiring OCR

These details have not been verified by PyPI

Project links

Project description

OCR Detection Library

A Python library to analyze PDF pages and determine whether they contain extractable text or are scanned images requiring OCR processing.

NEW in v0.3.0: Smart Image Extraction provides 5x faster performance for scanned PDFs with 33% less memory usage! Now includes 40x faster default processing mode and optimized parallel processing for large documents.

Features

Page Type Detection: Automatically classifies PDF pages as text, scanned, mixed, or empty
Smart Image Extraction: 5x faster image processing for scanned PDFs using embedded images
Base64 Image Output: Get page images as base64-encoded strings for visualization
Dual Processing Modes: Fast mode (40x faster) for speed, accuracy mode for precision
Parallel Processing: Fast analysis of large PDFs using multi-threading (up to 8x speedup)
Confidence Scoring: Reliability indicators for classifications
Memory Efficient: 33% reduction in memory usage with optimized image handling
Simple API: Easy-to-use interface with minimal complexity

Installation

# Clone or download the project
cd ocr-detection

# Install with uv (recommended)
uv sync

# Or install with pip
pip install ocr-detection

Usage

Quick Start

from ocr_detection import detect_ocr

# RECOMMENDED: Serverless mode with images - optimal for most use cases
# (12-17s for 1000+ pages, includes optimized images for OCR processing)
result = detect_ocr("document.pdf", serverless_mode=True, include_images=True)

# RECOMMENDED: Serverless mode for classification only - ultra-fast
# (sub-2 seconds for 1000+ pages, no images)
result = detect_ocr("document.pdf", serverless_mode=True)

# Traditional fast mode - 40x faster than accuracy mode
result = detect_ocr("document.pdf")

# Accuracy mode - slowest but most precise
result = detect_ocr("document.pdf", accuracy_mode=True)

print(result)
# Output: {"status": "partial", "pages": [1, 3, 7, 12]}

# Check the status
if result['status'] == "true":
    print("All pages need OCR")
elif result['status'] == "false":
    print("No pages need OCR")
else:  # partial
    print(f"Pages needing OCR: {result['pages']}")

Recommended Usage (Serverless Optimized)

For Google Cloud Functions/Run and other serverless environments:

from ocr_detection import detect_ocr, OCRDetection

# Option 1: Quick function call with images (RECOMMENDED)
# Perfect balance of speed and functionality
result = detect_ocr("document.pdf", serverless_mode=True, include_images=True)
# Performance: 12-17s for 1000+ pages with optimized images

# Option 2: Classification only (ultra-fast)
# When you only need to know which pages need OCR
result = detect_ocr("document.pdf", serverless_mode=True)
# Performance: sub-2 seconds for 1000+ pages

# Option 3: Class-based approach
detector = OCRDetection(serverless_mode=True, include_images=True)
result = detector.detect("document.pdf")

Using the OCRDetection Class

from ocr_detection import OCRDetection

# RECOMMENDED: Serverless mode - optimal for most use cases
# Automatically enables metadata_only=True, optimized images, and conservative parallelization
serverless_detector = OCRDetection(serverless_mode=True)

# RECOMMENDED: Serverless mode with images for OCR processing
# (12-17s for 1000+ pages with optimized ultra-fast image generation)
serverless_with_images = OCRDetection(serverless_mode=True, include_images=True)

# Traditional fast mode - 40x faster than accuracy mode
detector = OCRDetection(
    accuracy_mode=False,       # Fast mode (default)
    confidence_threshold=0.5,  # Minimum confidence for OCR detection
    parallel=True,             # Enable parallel processing
    include_images=False,      # No images by default
    image_format="png",        # Image format: "png" or "jpeg"
    image_dpi=150             # Image resolution (DPI)
)

# Accuracy mode - slowest but most precise
accurate_detector = OCRDetection(accuracy_mode=True)

# Analyze a document
result = detector.detect("document.pdf")

# With custom parallel settings for large documents
result = detector.detect("large_document.pdf", parallel=True, max_workers=8)

Understanding Results

The library returns a dictionary with the following fields:

status: Indicates the OCR requirement
- "true" - All pages need OCR processing
- "false" - No pages need OCR processing
- "partial" - Some pages need OCR processing
pages: List of page numbers (1-indexed) that need OCR processing
- Empty list when status is "false"
- Contains all page numbers when status is "true"
- Contains specific page numbers when status is "partial"
page_images: Dictionary mapping page numbers to base64-encoded images (when include_images=True)
- Only included for pages that need OCR processing
- Page numbers are 1-indexed to match PDF page numbering
- Images are base64-encoded PNG or JPEG strings

Examples

from ocr_detection import detect_ocr

# Example 1: Fully text-based PDF
result = detect_ocr("text_document.pdf")
# {"status": "false", "pages": []}

# Example 2: Scanned PDF
result = detect_ocr("scanned_document.pdf")
# {"status": "true", "pages": [1, 2, 3, 4, 5]}

# Example 3: Mixed content PDF
result = detect_ocr("mixed_document.pdf")
# {"status": "partial", "pages": [2, 5, 8]}

# Example 4: With base64 images
result = detect_ocr("document.pdf", include_images=True)
# {
#   "status": "partial", 
#   "pages": [2, 5], 
#   "page_images": {
#     2: "iVBORw0KGgoAAAANSUhEUgAA...",  # base64 PNG data
#     5: "iVBORw0KGgoAAAANSUhEUgAA..."   # base64 PNG data
#   }
# }

# Example 5: Custom image settings
result = detect_ocr(
    "document.pdf", 
    include_images=True,
    image_format="jpeg",  # Use JPEG instead of PNG
    image_dpi=200        # Higher resolution
)

# Example 6: With parallel processing for large PDFs
result = detect_ocr("large_document.pdf", parallel=True, max_workers=8)

# Example 7: Accuracy vs Speed modes
fast_result = detect_ocr("document.pdf")  # Fast mode (default)
accurate_result = detect_ocr("document.pdf", accuracy_mode=True)  # Accuracy mode

# Example 8: Serverless optimization (RECOMMENDED)
serverless_result = detect_ocr("document.pdf", serverless_mode=True, include_images=True)  # Optimal balance

# Example 9: Ultra-fast classification only
classify_result = detect_ocr("document.pdf", serverless_mode=True)  # Sub-2 seconds for 1000+ pages

Image Output Options

The library can generate base64-encoded images of pages that need OCR processing:

Parameters

include_images: bool - Enable base64 image output (default: False)
image_format: str - Output format: "png" or "jpeg" (default: "png")
image_dpi: int - Resolution in DPI (default: 150)

Usage Notes

Images are only generated for pages that need OCR processing
Smart extraction: Scanned pages use embedded images for 5x faster processing
Higher DPI values produce larger but clearer images (only affects rendered pages)
PNG format preserves quality but has larger file sizes
JPEG format is more compact but may have compression artifacts
Page numbers in page_images match those in the pages list (1-indexed)

Performance

Version 0.3.0 Optimization

The library now features Smart Image Extraction for dramatically improved performance:

5x faster processing for scanned PDFs (2.5s → 0.54s)
33% memory reduction (116MB → 79MB)
8x smaller image data (15.9MB → 2.0MB)
20x faster per-image processing (1.2s → 0.06s per image)

How It Works

Scanned PDFs: Extracts original embedded JPEG images directly (no re-rendering)
Text PDFs: Uses traditional rendering for vector content
Quality Preservation: Maintains original image compression and quality
Thread Safety: Works seamlessly with parallel processing

Processing Modes

Fast Mode (Default):

40x faster than accuracy mode
Uses optimized text extraction (PyMuPDF only)
Fast page classification heuristics
Recommended for most use cases

Accuracy Mode:

Maximum precision using dual text extraction
Comprehensive text quality analysis
Better for documents requiring high confidence
Use when precision is more important than speed

Automatic Optimization

The library automatically optimizes performance based on document size and content:

Documents with ≤10 pages use sequential processing
Larger documents automatically use parallel processing
Current parallel limit: 8 workers (configurable)
Parallel speedup: 3-8x performance improvement for large documents
Worker optimization: min(cpu_count, total_pages, max_workers)
Smart image extraction eliminates unnecessary rendering overhead

Performance Tuning

# For maximum speed on large documents
result = detect_ocr(
    "large_document.pdf",
    accuracy_mode=False,    # Fast mode
    parallel=True,          # Enable parallel processing
    max_workers=8          # Use up to 8 workers
)

# For maximum accuracy
result = detect_ocr(
    "document.pdf",
    accuracy_mode=True     # Accuracy mode (slower)
)

# Custom worker count for high-core systems
result = detect_ocr(
    "huge_document.pdf",
    parallel=True,
    max_workers=16         # Increase for powerful hardware
)

Benchmark Results

Large Document Test (1045 pages, 3.9MB):

Fast mode: ~8.0s
Fast mode + images: ~33.7s
Parallel processing: 3-8x faster than sequential
Memory usage: Optimized with 33% reduction

Performance Guidelines:

Use fast mode for general document analysis
Use accuracy mode when precision is critical
Parallel processing automatically enabled for >10 pages
Increase max_workers on high-core systems for better performance

License

MIT License - see LICENSE file for details

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.1

Aug 22, 2025

0.4.0

Aug 22, 2025

0.3.0

Aug 21, 2025

0.2.0

Aug 21, 2025

0.1.2

Aug 13, 2025

0.1.0

Aug 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocr_detection-0.4.1.tar.gz (20.6 kB view details)

Uploaded Aug 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ocr_detection-0.4.1-py3-none-any.whl (22.2 kB view details)

Uploaded Aug 22, 2025 Python 3

File details

Details for the file ocr_detection-0.4.1.tar.gz.

File metadata

Download URL: ocr_detection-0.4.1.tar.gz
Upload date: Aug 22, 2025
Size: 20.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for ocr_detection-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`4dd5948dde9db10160b947d8e5ce3007d227038ca87c900d57a48978b4ac29bb`
MD5	`a54d3154ef87b5d14b546e282b3c848b`
BLAKE2b-256	`8c38b76697fb20ef8d1ec78c5021e9fe47de10cc0866a1d04a15a97577d94eeb`

See more details on using hashes here.

File details

Details for the file ocr_detection-0.4.1-py3-none-any.whl.

File metadata

Download URL: ocr_detection-0.4.1-py3-none-any.whl
Upload date: Aug 22, 2025
Size: 22.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for ocr_detection-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d46d8c216c5438c52c83351924f2f2bb70ff0520c7c28dee32d36760eafa6548`
MD5	`610de67f6ec16c8ebc8de713ce194106`
BLAKE2b-256	`9ef44494f058f4ecc13c7723ea34cd1c0538e243f7e8395d3726f79465ddb16e`

See more details on using hashes here.

ocr-detection 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OCR Detection Library

Features

Installation

Usage

Quick Start

Recommended Usage (Serverless Optimized)

Using the OCRDetection Class

Understanding Results

Examples

Image Output Options

Parameters

Usage Notes

Performance

Version 0.3.0 Optimization

How It Works

Processing Modes

Automatic Optimization

Performance Tuning

Benchmark Results

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes