Page-wise PDF to Markdown extraction with text extraction, OCR, LLM fallback, and progress metadata.

These details have not been verified by PyPI

Project links

Project description

pagewise-pdf-extractor

Page-wise PDF to Markdown extraction with text extraction, OCR, LLM fallback, and progress metadata.

pagewise-pdf-extractor is a Python package and CLI for converting PDFs into deterministic page-level Markdown files. It routes each page through embedded-text extraction, scanned-page OCR, and optional vision-model fallback, then returns structured results for RAG and document-processing pipelines.

What It Does

Extracts text-native PDF pages with PyMuPDF.
Extracts scanned/image pages with Marker.
Rejects corrupt or structurally unreliable embedded text before provider routing.
Detects and splits two-up scanned pages before OCR.
Falls back to Ollama vision OCR when configured OCR fails.
Writes one UTF-8 Markdown file per page.
Writes atomic progress.json with provider attempts, status, config hash, source hash, and page metadata.
Exposes a library API for applications and a CLI for operators.
Keeps local processing as the default; remote services are only used if explicitly configured.

Status

v0.3.0 is the current documented release. The public API is intended for early downstream use by applications that need page-wise PDF extraction, but the project is still pre-1.0.

Install

From PyPI after publication:

python -m pip install pagewise-pdf-extractor

Pinned Git dependency:

pagewise-pdf-extractor @ git+https://github.com/ebmurha/pagewise-pdf-extractor.git@v0.3.0

Local development:

python -m pip install -e D:\Developer\Projects\pagewise-pdf-extractor

Runtime dependencies are declared in pyproject.toml. OCR providers also require local binaries:

marker_single for Marker OCR
ollama for Ollama fallback
pdftoppm for rendering pages passed to Ollama

Check the local environment:

pagewise-pdf-extractor --validate-environment

Quickstart

CLI:

pagewise-pdf-extractor document.pdf --output-root output

Python:

from pathlib import Path

from pagewise_pdf_extractor import ExtractionConfig, process_pdf, validate_environment

config = ExtractionConfig(
    text_provider="pymupdf",
    ocr_provider="marker",
    fallback_provider="ollama",
    fallback_enabled=True,
    ollama_model="deepseek-ocr",
)

report = validate_environment(config)
if report.has_fatal_errors:
    raise RuntimeError(report.summary)

result = process_pdf(
    input_pdf=Path("document.pdf"),
    output_root=Path("output"),
    config=config,
)

Public import contract:

from pagewise_pdf_extractor import (
    ExtractionConfig,
    ExtractionResult,
    process_pdf,
    validate_environment,
)

Output

Default layout:

output/
  <input_sha256>/
    page_0001.md
    page_0002.md
    progress.json

Successful page:

# Page N

<provider markdown content>

Failed page:

# Page N

OCR FAILED

Error: <error_message>

Provider Routing

Default page-level routing:

Try embedded text extraction with PyMuPDF.
Validate embedded text for corrupt glyphs, abnormal spacing, and table-heavy visual layout.
Accept embedded text when it meets configured length and structural quality thresholds.
Render Marker input at 350 DPI and split confidently detected two-up pages into logical pages.
Use Marker OCR when embedded text is absent, low quality, or force_ocr=True.
Use Ollama fallback when Marker fails or returns unusable output and fallback is enabled.
Write failure Markdown if all configured providers fail.

Documentation

Tests

python -m unittest discover -s tests -p "test_*.py"
python -c "from pagewise_pdf_extractor import ExtractionConfig, process_pdf, validate_environment; print('ok')"
pagewise-pdf-extractor --help

Future Goals

Future work to preserve the public API, keep provider behavior explicit, and add new providers or extraction quality improvements behind documented configuration.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Jun 8, 2026

0.2.0

May 10, 2026

0.1.1

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pagewise_pdf_extractor-0.3.0.tar.gz (30.6 kB view details)

Uploaded Jun 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pagewise_pdf_extractor-0.3.0-py3-none-any.whl (31.5 kB view details)

Uploaded Jun 8, 2026 Python 3

File details

Details for the file pagewise_pdf_extractor-0.3.0.tar.gz.

File metadata

Download URL: pagewise_pdf_extractor-0.3.0.tar.gz
Upload date: Jun 8, 2026
Size: 30.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pagewise_pdf_extractor-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`1e2b26e4213e53c5b9ed2bd6565fafa250f396c3e5cc5c1411d5739c146b1e5c`
MD5	`90b54e4f6a03ac03120d8949116a1576`
BLAKE2b-256	`74a95eaafc708875e5c29a4dfbb019ac2133d2f88fcf399f0c90332fa62c1633`

See more details on using hashes here.

File details

Details for the file pagewise_pdf_extractor-0.3.0-py3-none-any.whl.

File metadata

Download URL: pagewise_pdf_extractor-0.3.0-py3-none-any.whl
Upload date: Jun 8, 2026
Size: 31.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pagewise_pdf_extractor-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b972d4144372a12b12c110dce2e84c21fe88cf8f629c0bc8654788955629bdf9`
MD5	`e79ab8263b8125368dbdd3d6a888ac88`
BLAKE2b-256	`b474ecdf8b731736d45e8c5c0bb17daa584ebfc9ea4c7c44ba7da1ffd80baf76`

See more details on using hashes here.

pagewise-pdf-extractor 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pagewise-pdf-extractor

What It Does

Status

Install

Quickstart

Output

Provider Routing

Documentation

Tests

Future Goals

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes