A Python library to detect whether PDF pages contain extractable text or are scanned images requiring OCR

These details have not been verified by PyPI

Project links

Project description

OCR Detection Library

A Python library to analyze PDF pages and determine whether they contain extractable text or are scanned images requiring OCR processing. Now with parallel processing support for faster analysis of large PDFs!

Features

Page Type Detection: Automatically classifies PDF pages as:
- text: Pages with extractable text content
- scanned: Pages that are primarily scanned images
- mixed: Pages with both text and significant image content
- empty: Pages with minimal content
Parallel Processing: Fast analysis of large PDFs using multi-threading
- Automatic optimization based on PDF size
- Configurable worker threads
- 3-8x performance improvement for large documents
Content Analysis: Advanced text quality metrics and OCR artifact detection
CLI Interface: Easy-to-use command-line tool with parallel options
Multiple Output Formats: JSON, CSV, and text summary formats
Confidence Scoring: Reliability indicators for classifications

Installation

# Clone or download the project
cd ocr-detection

# Install with uv (recommended)
uv sync

# Or install with pip
pip install -e .

Quick Start

Python Library Usage

Simple API (Recommended)

from ocr_detection import OCRDetection, detect_ocr

# Method 1: Using the class
detector = OCRDetection()
result = detector.detect("document.pdf")

print(result)
# Output: {"status": "partial", "pages": [1, 3, 7, 12]}

# Method 2: Using the convenience function
result = detect_ocr("document.pdf")

if result['status'] == "true":
    print("All pages need OCR")
elif result['status'] == "false":
    print("No pages need OCR")
else:  # partial
    print(f"Pages needing OCR: {result['pages']}")

# Method 3: With parallel processing for faster analysis
detector = OCRDetection(parallel=True)
result = detector.detect("large_document.pdf", max_workers=4)
# Or simply:
result = detect_ocr("large_document.pdf", parallel=True)

Enhanced API

from ocr_detection import OCRDetector

# Initialize detector
detector = OCRDetector()

# Quick check
recommendation = detector.quick_check("document.pdf")
print(f"Recommendation: {recommendation}")

# Get pages needing OCR
pages = detector.get_pages_needing_ocr("document.pdf")
print(f"Pages needing OCR: {pages}")

# Detailed analysis
result = detector.analyze_pdf("document.pdf")
print(f"Total pages: {result.total_pages}")
print(f"Pages needing OCR: {result.pages_needing_ocr}")

Command Line Usage

# Basic analysis
uv run ocr-detect document.pdf

# Analyze specific page
uv run ocr-detect document.pdf --page 0

# Generate JSON output
uv run ocr-detect document.pdf --format json --output results.json

# Verbose analysis with text preview
uv run ocr-detect document.pdf --verbose --include-text

# CSV export with custom confidence threshold
uv run ocr-detect document.pdf --format csv --confidence-threshold 0.8

# Parallel processing for large PDFs
uv run ocr-detect large-document.pdf --parallel

# Parallel processing with custom worker count
uv run ocr-detect large-document.pdf --parallel --workers 4 --verbose

CLI Options

Option	Description
`--output, -o`	Output file path (format determined by extension)
`--format, -f`	Output format: json, csv, text, or summary (default)
`--page, -p`	Analyze specific page only (0-indexed)
`--verbose, -v`	Show detailed analysis and timing information
`--include-text`	Include extracted text preview in output
`--confidence-threshold`	Minimum confidence threshold (default: 0.5)
`--parallel`	Enable parallel processing for faster analysis
`--workers`	Number of worker threads for parallel processing

Example Output

PDF CONTENT ANALYSIS SUMMARY
============================================================

Total Pages: 10
Average Confidence: 0.85

Page Type Distribution:
  Text    :   6 pages ( 60.0%)
  Scanned :   3 pages ( 30.0%)
  Mixed   :   1 pages ( 10.0%)

Recommendation: Consider OCR for optimal text extraction

�  Pages with low confidence (< 0.5):
  Page 7: mixed (confidence: 0.45)

Testing

# Run all unit tests
uv run pytest tests/

# Run basic functionality test
uv run python test_basic.py

# Run integration tests with real PDFs
uv run python tests/test_integration_basic.py
uv run python tests/test_integration_advanced.py

# Run specific test modules
uv run pytest tests/test_detector.py::TestParallelProcessing -v

# Run with coverage (if pytest-cov is installed)
uv run pytest tests/ --cov=ocr_detection

Use Cases

Document Processing Pipelines: Automatically determine optimal text extraction method
OCR Pre-processing: Identify which pages need OCR vs direct text extraction
Content Quality Assessment: Evaluate PDF text extraction reliability
Batch Document Analysis: Process large collections of PDF files efficiently

Parallel Processing

The library automatically optimizes processing based on PDF size:

Small PDFs (≤10 pages): Sequential processing for minimal overhead
Large PDFs (>10 pages): Parallel processing with multi-threading
Automatic worker management: Intelligently selects thread count based on CPU cores and document size

Performance Benchmarks

PDF Size	Sequential Time	Parallel Time (4 workers)	Speedup
10 pages	0.5s	0.5s	1x (sequential used)
50 pages	2.5s	0.8s	3.1x
100 pages	5.0s	1.3s	3.8x
500 pages	25.0s	4.2s	6.0x

Advanced Parallel Usage

from ocr_detection import PDFAnalyzer

# Manual control over parallel processing
with PDFAnalyzer("large_document.pdf") as analyzer:
    # Use parallel processing with custom worker count
    results = analyzer.analyze_all_pages_parallel(max_workers=8)
    
    # Or let the system decide
    results = analyzer.analyze_all_pages_auto(parallel=True)
    
    # Get summary with timing info
    summary = analyzer.get_summary(results)

Technical Details

The library uses multiple detection methods:

Text Extraction: Uses both PyMuPDF and pdfplumber for robust text extraction
Image Analysis: Detects and measures embedded images using PyMuPDF
Content Ratios: Calculates text-to-image ratios for classification
Quality Metrics: Analyzes text characteristics and OCR artifacts
Confidence Scoring: Provides reliability indicators based on multiple factors
Parallel Processing: Thread-safe page analysis with automatic optimization

Dependencies

Python 3.13+
PyMuPDF (fitz) - PDF processing and image extraction
pdfplumber - Alternative text extraction
click - CLI interface

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

We welcome contributions to the OCR Detection Library! Please see our Contributing Guide for details on:

Code of Conduct
How to submit bug reports and feature requests
Development setup and workflow
Pull request process
Code style and testing requirements

Quick start for contributors:

# Fork and clone the repository
git clone https://github.com/yourusername/ocr-detection.git
cd ocr-detection

# Set up development environment
uv sync

# Run tests before submitting PR
uv run pytest tests/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.1

Aug 22, 2025

0.4.0

Aug 22, 2025

0.3.0

Aug 21, 2025

0.2.0

Aug 21, 2025

0.1.2

Aug 13, 2025

This version

0.1.0

Aug 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocr_detection-0.1.0.tar.gz (18.5 kB view details)

Uploaded Aug 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ocr_detection-0.1.0-py3-none-any.whl (22.2 kB view details)

Uploaded Aug 11, 2025 Python 3

File details

Details for the file ocr_detection-0.1.0.tar.gz.

File metadata

Download URL: ocr_detection-0.1.0.tar.gz
Upload date: Aug 11, 2025
Size: 18.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for ocr_detection-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`adb580538acbd6f2b5a7f5896046017628a34f0d05af201856c5f55b0effce07`
MD5	`5725087849d33502f46ce5a347544e72`
BLAKE2b-256	`a187af3676354cae90c9a75474bfdbd38f2e442f1c095084f36bb7240b8053ac`

See more details on using hashes here.

File details

Details for the file ocr_detection-0.1.0-py3-none-any.whl.

File metadata

Download URL: ocr_detection-0.1.0-py3-none-any.whl
Upload date: Aug 11, 2025
Size: 22.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for ocr_detection-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a8df04b78a738781af3446ae82190fc587863f098a012a72955fcfb687aab9d7`
MD5	`71daf36619858e1fe0298ad637616c74`
BLAKE2b-256	`af40ae0d8ab29e2c72c09cd0e0c6eca4fecd5e93ffd16643c4ae4215d5c9bdcd`

See more details on using hashes here.

ocr-detection 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OCR Detection Library

Features

Installation

Quick Start

Python Library Usage

Simple API (Recommended)

Enhanced API

Command Line Usage

CLI Options

Example Output

Testing

Use Cases

Parallel Processing

Performance Benchmarks

Advanced Parallel Usage

Technical Details

Dependencies

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes