Intelligent, parallel PDF page rotation correction.

These details have not been verified by PyPI

Project links

Project description

PDFOrienter

Intelligent, parallel PDF page rotation correction for Python.

PDFOrienter analyses every page of one or more PDF files, detects incorrect orientations, and fixes them in a single write pass — with no unnecessary re-processing.

Features

Two-phase pipeline — detect all pages in parallel, then apply all corrections in a single write
Smart strategy selection — uses fast text-direction analysis for text-based pages; falls back to Tesseract OSD only for image/scanned pages
Dynamic parallelism — automatically uses 75 % of available CPU cores; scales from 4 to 64+ cores without any configuration change
Detailed structured logging — per-page and per-file timing, rotation details, confidence scores, RAM and CPU usage
Zero intermediate files — corrected PDFs are written once; originals are never modified
Package-ready — clean modular design, typed, fully testable

Requirements

Python

Python 3.10 or newer.

System dependency — Tesseract

Tesseract must be installed on the host system before installing PDFOrienter.

Ubuntu / Debian

sudo apt-get update && sudo apt-get install -y tesseract-ocr

macOS (Homebrew)

brew install tesseract

Windows

Download and run the installer from the Tesseract UB Mannheim releases, then add the install directory to your PATH.

Installation

pip install pdforienter

For development (includes linting + test tools):

git clone https://github.com/your-org/pdforienter.git
cd pdforienter
pip install -e ".[dev]"

Quick Start

Command line

# Fix a single PDF
pdforienter invoice.pdf --output ./fixed

# Fix every PDF in a directory
pdforienter /scans/ --output /corrected

# Mix files and directories
pdforienter report.pdf /archive/ receipts.pdf --output ./out

Python API

from pdforienter import run_pipeline
from pdforienter.logging.writer import write_log

result = run_pipeline(
    pdf_paths=["invoice.pdf", "report.pdf"],
    output_dir="./corrected",
)

# Write the structured log file
log_path = write_log(result, "./corrected")

print(f"{result.total_pages_changed} pages corrected in {result.total_duration_seconds:.1f}s")
print(f"Log: {log_path}")

Log File

Every run produces a timestamped .log file in the output directory.

PDFOrienter Run Log — 2024-11-01 14:32:05
============================================================

[RUN SUMMARY]
  Total files processed : 3
  Total pages           : 247
  Pages rotated         : 18
  Text pages            : 201
  Scanned pages (OCR)   : 46
  Skipped pages         : 0
  Workers used          : 6
  Current RAM usage     : 312.4 MB
  Total time            : 42.18s

------------------------------------------------------------
[FILE] /scans/invoice.pdf
  Output          : /corrected/invoice_corrected.pdf
  Total pages     : 12
  Pages changed   : 3
  Text pages      : 8
  Scanned pages   : 4
  Skipped pages   : 0
  Detection time  : 9.41s
  Correction time : 0.23s
  Total time      : 9.64s
  [PAGE DETAILS]
     p   1 | text    | OK      | angle=  0° | conf= 98.2 | 0.11s | No rotation needed.
     p   2 | scanned | CHANGED | angle= 90° | conf= 87.5 | 2.34s | Rotation of 90° detected (confidence 87.5).
     ...

Project Structure

pdforienter/
├── pdforienter/
│   ├── __init__.py          # Public API: run_pipeline
│   ├── config.py            # Tuneable constants (worker count, thresholds)
│   ├── models.py            # Typed data classes (PageResult, FileResult, RunResult)
│   ├── cli.py               # Command-line interface
│   ├── core/
│   │   ├── pipeline.py      # Top-level orchestrator
│   │   ├── processor.py     # Per-file orchestrator (Phase 1 + Phase 2)
│   │   ├── analyzer.py      # Per-page worker (dispatched to subprocess)
│   │   ├── classifier.py    # Text vs scanned page detection
│   │   ├── detector.py      # Orientation detection (text + OSD strategies)
│   │   └── corrector.py     # Single-pass rotation applier
│   ├── logging/
│   │   ├── formatter.py     # RunResult → structured log string
│   │   └── writer.py        # Write log file to disk
│   └── utils/
│       ├── fs.py            # Filesystem helpers
│       └── resources.py     # CPU / RAM telemetry
├── tests/
│   └── test_core.py
├── pyproject.toml
└── README.md

Configuration

All tuneable constants live in pdforienter/config.py.

Constant	Default	Description
`MAX_WORKERS`	`floor(cpu_count × 0.75)`	Worker processes for parallel page analysis
`OSD_CONFIDENCE_THRESHOLD`	`10.0`	Minimum Tesseract OSD confidence to trust a result
`TESSERACT_OSD_PSM`	`0`	Tesseract page segmentation mode (0 = OSD only)
`_RENDER_DPI` (detector.py)	`150`	DPI used when rasterising pages for OSD
`_MIN_CHAR_COUNT` (classifier.py)	`20`	Minimum characters to classify a page as text-based

How It Works

Phase 1 — Parallel Detection

Each page is dispatched to a subprocess worker via ProcessPoolExecutor. Workers run concurrently up to MAX_WORKERS.

For each page:

Classify — does the page have selectable text?
Detect orientation
- Text page → analyse character direction vectors (fast, no OCR)
- Scanned page → rasterise at 300 DPI and run multi-pass Tesseract OSD
Return a PageResult with the detected angle, confidence, and timing

Phase 2 — Single-Pass Correction

After all pages are analysed, a single write pass produces the corrected PDF. By default PDFOrienter bakes the rotation into each page's content so the output is genuinely upright with /Rotate=0 — it displays correctly in every viewer and tool, including those that ignore the /Rotate page attribute (image converters, some print drivers, OCR front-ends).

Vector text stays selectable after baking, but page-level annotations, links, and form fields are not preserved. If you need those, pass --no-bake (CLI) or bake=False (API) for lossless metadata-only rotation that sets /Rotate instead.

# Default: physically upright output, works everywhere
pdforienter scan.pdf --output ./fixed

# Lossless: keep annotations/forms, rely on viewer honouring /Rotate
pdforienter scan.pdf --output ./fixed --no-bake

Performance

Typical estimates on an 8-core server (6 workers) with mixed text/scanned PDFs:

Scenario	Estimate
2 000 pages, all text-based	~1–2 minutes
2 000 pages, mixed 50/50	~7–8 minutes
2 000 pages, all scanned	~15–17 minutes

RAM usage: ~200–400 MB per Tesseract worker. 6 workers ≈ 2.5 GB peak. Well within a 16 GB server.

Running Tests

pytest tests/ -v

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jun 1, 2026

0.1.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdforienter-0.1.1.tar.gz (27.6 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdforienter-0.1.1-py3-none-any.whl (24.0 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file pdforienter-0.1.1.tar.gz.

File metadata

Download URL: pdforienter-0.1.1.tar.gz
Upload date: Jun 1, 2026
Size: 27.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pdforienter-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3851b665cc08848f05f2f4a681adcb513021f2d9d7529de4d51b72bbf9069bac`
MD5	`3ee5a5df7c267c80a4406c45242a9af8`
BLAKE2b-256	`f337414155f8a93dbdf08abed4fed5ab1fbe48d19ab36e576b770870ffe79ab0`

See more details on using hashes here.

File details

Details for the file pdforienter-0.1.1-py3-none-any.whl.

File metadata

Download URL: pdforienter-0.1.1-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 24.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for pdforienter-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ef97e7d18397e6034ad8ab0b3cfdc3f3cc6bac10efb6f3d6a5398004f14311e3`
MD5	`e32e10cdf055042959adae9ecf092120`
BLAKE2b-256	`c3f79b769e850e0156b324462fbc762beed04cc125f942b600aba98f6568d016`

See more details on using hashes here.

pdforienter 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDFOrienter

Features

Requirements

Python

System dependency — Tesseract

Installation

Quick Start

Command line

Python API

Log File

Project Structure

Configuration

How It Works

Phase 1 — Parallel Detection

Phase 2 — Single-Pass Correction

Performance

Running Tests

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes