Skip to main content

A Python library to detect whether PDF pages contain extractable text or are scanned images requiring OCR

Project description

OCR Detection Library

A Python library to analyze PDF pages and determine whether they contain extractable text or are scanned images requiring OCR processing.

Features

  • Page Type Detection: Automatically classifies PDF pages as text, scanned, mixed, or empty
  • Parallel Processing: Fast analysis of large PDFs using multi-threading
  • Confidence Scoring: Reliability indicators for classifications
  • Simple API: Easy-to-use interface with minimal complexity

Installation

# Clone or download the project
cd ocr-detection

# Install with uv (recommended)
uv sync

# Or install with pip
pip install -e .

Usage

Quick Start

from ocr_detection import detect_ocr

# Analyze a PDF document
result = detect_ocr("document.pdf")

print(result)
# Output: {"status": "partial", "pages": [1, 3, 7, 12]}

# Check the status
if result['status'] == "true":
    print("All pages need OCR")
elif result['status'] == "false":
    print("No pages need OCR")
else:  # partial
    print(f"Pages needing OCR: {result['pages']}")

Using the OCRDetection Class

from ocr_detection import OCRDetection

# Initialize detector with options
detector = OCRDetection(
    confidence_threshold=0.5,  # Minimum confidence for OCR detection
    parallel=True              # Enable parallel processing
)

# Analyze a document
result = detector.detect("document.pdf")

# With custom parallel settings
result = detector.detect("large_document.pdf", max_workers=4)

Understanding Results

The library returns a simple dictionary with two fields:

  • status: Indicates the OCR requirement

    • "true" - All pages need OCR processing
    • "false" - No pages need OCR processing
    • "partial" - Some pages need OCR processing
  • pages: List of page numbers (1-indexed) that need OCR processing

    • Empty list when status is "false"
    • Contains all page numbers when status is "true"
    • Contains specific page numbers when status is "partial"

Examples

from ocr_detection import detect_ocr

# Example 1: Fully text-based PDF
result = detect_ocr("text_document.pdf")
# {"status": "false", "pages": []}

# Example 2: Scanned PDF
result = detect_ocr("scanned_document.pdf")
# {"status": "true", "pages": [1, 2, 3, 4, 5]}

# Example 3: Mixed content PDF
result = detect_ocr("mixed_document.pdf")
# {"status": "partial", "pages": [2, 5, 8]}

# Example 4: With parallel processing for large PDFs
result = detect_ocr("large_document.pdf", parallel=True)

Performance

The library automatically optimizes performance based on document size:

  • Documents with ≤10 pages use sequential processing
  • Larger documents use parallel processing with configurable worker threads
  • Parallel processing provides 3-8x performance improvement for large documents

License

MIT License - see LICENSE file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocr_detection-0.2.0.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocr_detection-0.2.0-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file ocr_detection-0.2.0.tar.gz.

File metadata

  • Download URL: ocr_detection-0.2.0.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for ocr_detection-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d3078f510fcf57beab8b8e6f24baf359ad5e17932fb346c0d27b0a1795635404
MD5 8c5b32dfcf63860bc970ba8d745560c9
BLAKE2b-256 34e7478d146f03caa0567c953610670568ff14305011c3f78741d1293342ce43

See more details on using hashes here.

File details

Details for the file ocr_detection-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ocr_detection-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for ocr_detection-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 07f10665ff54fa5d68718b7e144f454bbf63f1f75ad5f7abd63e88724e55918a
MD5 59daaa14545f47d1841f824abb00c14a
BLAKE2b-256 0314f7b627629581471d75b40d4a105db72f14b8487b865d250048ebb6b99c5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page