Skip to main content

Intelligent, parallel PDF page rotation correction.

Project description

PDFOrienter

Intelligent, parallel PDF page rotation correction for Python.

PDFOrienter analyses every page of one or more PDF files, detects incorrect orientations, and fixes them in a single write pass — with no unnecessary re-processing.


Features

  • Two-phase pipeline — detect all pages in parallel, then apply all corrections in a single write
  • Smart strategy selection — uses fast text-direction analysis for text-based pages; falls back to Tesseract OSD only for image/scanned pages
  • Dynamic parallelism — automatically uses 75 % of available CPU cores; scales from 4 to 64+ cores without any configuration change
  • Detailed structured logging — per-page and per-file timing, rotation details, confidence scores, RAM and CPU usage
  • Zero intermediate files — corrected PDFs are written once; originals are never modified
  • Package-ready — clean modular design, typed, fully testable

Requirements

Python

Python 3.10 or newer.

System dependency — Tesseract

Tesseract must be installed on the host system before installing PDFOrienter.

Ubuntu / Debian

sudo apt-get update && sudo apt-get install -y tesseract-ocr

macOS (Homebrew)

brew install tesseract

Windows

Download and run the installer from the Tesseract UB Mannheim releases, then add the install directory to your PATH.


Installation

pip install pdforienter

For development (includes linting + test tools):

git clone https://github.com/your-org/pdforienter.git
cd pdforienter
pip install -e ".[dev]"

Quick Start

Command line

# Fix a single PDF
pdforienter invoice.pdf --output ./fixed

# Fix every PDF in a directory
pdforienter /scans/ --output /corrected

# Mix files and directories
pdforienter report.pdf /archive/ receipts.pdf --output ./out

Python API

from pdforienter import run_pipeline
from pdforienter.logging.writer import write_log

result = run_pipeline(
    pdf_paths=["invoice.pdf", "report.pdf"],
    output_dir="./corrected",
)

# Write the structured log file
log_path = write_log(result, "./corrected")

print(f"{result.total_pages_changed} pages corrected in {result.total_duration_seconds:.1f}s")
print(f"Log: {log_path}")

Log File

Every run produces a timestamped .log file in the output directory.

PDFOrienter Run Log — 2024-11-01 14:32:05
============================================================

[RUN SUMMARY]
  Total files processed : 3
  Total pages           : 247
  Pages rotated         : 18
  Text pages            : 201
  Scanned pages (OCR)   : 46
  Skipped pages         : 0
  Workers used          : 6
  Peak RAM usage        : 312.4 MB
  Total time            : 42.18s

------------------------------------------------------------
[FILE] /scans/invoice.pdf
  Output          : /corrected/invoice_corrected.pdf
  Total pages     : 12
  Pages changed   : 3
  Text pages      : 8
  Scanned pages   : 4
  Skipped pages   : 0
  Detection time  : 9.41s
  Correction time : 0.23s
  Total time      : 9.64s
  [PAGE DETAILS]
     p   1 | text    | OK      | angle=  0° | conf= 98.2 | 0.11s | No rotation needed.
     p   2 | scanned | CHANGED | angle= 90° | conf= 87.5 | 2.34s | Rotation of 90° detected (confidence 87.5).
     ...

Project Structure

pdforienter/
├── pdforienter/
│   ├── __init__.py          # Public API: run_pipeline
│   ├── config.py            # Tuneable constants (worker count, thresholds)
│   ├── models.py            # Typed data classes (PageResult, FileResult, RunResult)
│   ├── cli.py               # Command-line interface
│   ├── core/
│   │   ├── pipeline.py      # Top-level orchestrator
│   │   ├── processor.py     # Per-file orchestrator (Phase 1 + Phase 2)
│   │   ├── analyzer.py      # Per-page worker (dispatched to subprocess)
│   │   ├── classifier.py    # Text vs scanned page detection
│   │   ├── detector.py      # Orientation detection (text + OSD strategies)
│   │   └── corrector.py     # Single-pass rotation applier
│   ├── logging/
│   │   ├── formatter.py     # RunResult → structured log string
│   │   └── writer.py        # Write log file to disk
│   └── utils/
│       ├── fs.py            # Filesystem helpers
│       └── resources.py     # CPU / RAM telemetry
├── tests/
│   └── test_core.py
├── pyproject.toml
└── README.md

Configuration

All tuneable constants live in pdforienter/config.py.

Constant Default Description
MAX_WORKERS floor(cpu_count × 0.75) Worker processes for parallel page analysis
OSD_CONFIDENCE_THRESHOLD 10.0 Minimum Tesseract OSD confidence to trust a result
TESSERACT_OSD_PSM 0 Tesseract page segmentation mode (0 = OSD only)
_RENDER_DPI (detector.py) 150 DPI used when rasterising pages for OSD
_MIN_CHAR_COUNT (classifier.py) 20 Minimum characters to classify a page as text-based

How It Works

Phase 1 — Parallel Detection

Each page is dispatched to a subprocess worker via ProcessPoolExecutor. Workers run concurrently up to MAX_WORKERS.

For each page:

  1. Classify — does the page have selectable text?
  2. Detect orientation
    • Text page → analyse character direction vectors (fast, no OCR)
    • Scanned page → rasterise at 150 DPI and run Tesseract OSD
  3. Return a PageResult with the detected angle, confidence, and timing

Phase 2 — Single-Pass Correction

After all pages are analysed, a single fitz.Document.save() call applies every rotation and writes the corrected PDF. No intermediate files are created.


Performance

Typical estimates on an 8-core server (6 workers) with mixed text/scanned PDFs:

Scenario Estimate
2 000 pages, all text-based ~1–2 minutes
2 000 pages, mixed 50/50 ~7–8 minutes
2 000 pages, all scanned ~15–17 minutes

RAM usage: ~200–400 MB per Tesseract worker. 6 workers ≈ 2.5 GB peak. Well within a 16 GB server.


Running Tests

pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdforienter-0.1.0.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdforienter-0.1.0-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file pdforienter-0.1.0.tar.gz.

File metadata

  • Download URL: pdforienter-0.1.0.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pdforienter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a6f8242f48c525baf50689938e9d68d1e68ea96aafbfff6f1426914c0f59ed14
MD5 485a0e2fb7b5933f6f6ab05acb2fe071
BLAKE2b-256 49a5da3950f1c86b52a1858c71f818659aa56f294e42e004dfcb6c1fbd667001

See more details on using hashes here.

File details

Details for the file pdforienter-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdforienter-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pdforienter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a8daad5d65868ff3b97bdb5045347729d88c6105019c9b12846d01c7e200651f
MD5 9289863fdea6c90018e37722533b64ea
BLAKE2b-256 199cadcd039ce53962a57fc8f6150846846fad4b9f88ed365bd10f446ed36464

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page