A Python library to detect whether PDF pages contain extractable text or are scanned images requiring OCR

These details have not been verified by PyPI

Project links

Project description

OCR Detection Library

A Python library to analyze PDF pages and determine whether they contain extractable text or are scanned images requiring OCR processing.

Features

Page Type Detection: Automatically classifies PDF pages as text, scanned, mixed, or empty
Parallel Processing: Fast analysis of large PDFs using multi-threading
Confidence Scoring: Reliability indicators for classifications
Simple API: Easy-to-use interface with minimal complexity

Installation

# Clone or download the project
cd ocr-detection

# Install with uv (recommended)
uv sync

# Or install with pip
pip install -e .

Usage

Quick Start

from ocr_detection import detect_ocr

# Analyze a PDF document
result = detect_ocr("document.pdf")

print(result)
# Output: {"status": "partial", "pages": [1, 3, 7, 12]}

# Check the status
if result['status'] == "true":
    print("All pages need OCR")
elif result['status'] == "false":
    print("No pages need OCR")
else:  # partial
    print(f"Pages needing OCR: {result['pages']}")

Using the OCRDetection Class

from ocr_detection import OCRDetection

# Initialize detector with options
detector = OCRDetection(
    confidence_threshold=0.5,  # Minimum confidence for OCR detection
    parallel=True              # Enable parallel processing
)

# Analyze a document
result = detector.detect("document.pdf")

# With custom parallel settings
result = detector.detect("large_document.pdf", max_workers=4)

Understanding Results

The library returns a simple dictionary with two fields:

status: Indicates the OCR requirement
- "true" - All pages need OCR processing
- "false" - No pages need OCR processing
- "partial" - Some pages need OCR processing
pages: List of page numbers (1-indexed) that need OCR processing
- Empty list when status is "false"
- Contains all page numbers when status is "true"
- Contains specific page numbers when status is "partial"

Examples

from ocr_detection import detect_ocr

# Example 1: Fully text-based PDF
result = detect_ocr("text_document.pdf")
# {"status": "false", "pages": []}

# Example 2: Scanned PDF
result = detect_ocr("scanned_document.pdf")
# {"status": "true", "pages": [1, 2, 3, 4, 5]}

# Example 3: Mixed content PDF
result = detect_ocr("mixed_document.pdf")
# {"status": "partial", "pages": [2, 5, 8]}

# Example 4: With parallel processing for large PDFs
result = detect_ocr("large_document.pdf", parallel=True)

Performance

The library automatically optimizes performance based on document size:

Documents with ≤10 pages use sequential processing
Larger documents use parallel processing with configurable worker threads
Parallel processing provides 3-8x performance improvement for large documents

License

MIT License - see LICENSE file for details

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.1

Aug 22, 2025

0.4.0

Aug 22, 2025

0.3.0

Aug 21, 2025

This version

0.2.0

Aug 21, 2025

0.1.2

Aug 13, 2025

0.1.0

Aug 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocr_detection-0.2.0.tar.gz (14.0 kB view details)

Uploaded Aug 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ocr_detection-0.2.0-py3-none-any.whl (15.5 kB view details)

Uploaded Aug 21, 2025 Python 3

File details

Details for the file ocr_detection-0.2.0.tar.gz.

File metadata

Download URL: ocr_detection-0.2.0.tar.gz
Upload date: Aug 21, 2025
Size: 14.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for ocr_detection-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d3078f510fcf57beab8b8e6f24baf359ad5e17932fb346c0d27b0a1795635404`
MD5	`8c5b32dfcf63860bc970ba8d745560c9`
BLAKE2b-256	`34e7478d146f03caa0567c953610670568ff14305011c3f78741d1293342ce43`

See more details on using hashes here.

File details

Details for the file ocr_detection-0.2.0-py3-none-any.whl.

File metadata

Download URL: ocr_detection-0.2.0-py3-none-any.whl
Upload date: Aug 21, 2025
Size: 15.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for ocr_detection-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`07f10665ff54fa5d68718b7e144f454bbf63f1f75ad5f7abd63e88724e55918a`
MD5	`59daaa14545f47d1841f824abb00c14a`
BLAKE2b-256	`0314f7b627629581471d75b40d4a105db72f14b8487b865d250048ebb6b99c5a`

See more details on using hashes here.

ocr-detection 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OCR Detection Library

Features

Installation

Usage

Quick Start

Using the OCRDetection Class

Understanding Results

Examples

Performance

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes