Page-wise PDF to Markdown extraction with text extraction, OCR, LLM fallback, and progress metadata.
Project description
pagewise-pdf-extractor
Page-wise PDF to Markdown extraction with text extraction, OCR, LLM fallback, and progress metadata.
pagewise-pdf-extractor is a Python package and CLI for converting PDFs into deterministic page-level Markdown files. It routes each page through embedded-text extraction, scanned-page OCR, and optional vision-model fallback, then returns structured results for RAG and document-processing pipelines.
What It Does
- Extracts text-native PDF pages with PyMuPDF.
- Extracts scanned/image pages with Marker.
- Falls back to Ollama vision OCR when configured OCR fails.
- Writes one UTF-8 Markdown file per page.
- Writes atomic
progress.jsonwith provider attempts, status, config hash, source hash, and page metadata. - Exposes a library API for applications and a CLI for operators.
- Keeps local processing as the default; remote services are only used if explicitly configured.
Status
v0.1.1 is the current public release. The public API is intended for early downstream use by applications that need page-wise PDF extraction, but the project is still pre-1.0.
Install
From PyPI after publication:
python -m pip install pagewise-pdf-extractor
Pinned Git dependency:
pagewise-pdf-extractor @ git+https://github.com/ebmurha/pagewise-pdf-extractor.git@v0.1.1
Local development:
python -m pip install -e D:\Developer\Projects\pagewise-pdf-extractor
Runtime dependencies are declared in pyproject.toml. OCR providers also require local binaries:
marker_singlefor Marker OCRollamafor Ollama fallbackpdftoppmfor rendering pages passed to Ollama
Check the local environment:
pagewise-pdf-extractor --validate-environment
Quickstart
CLI:
pagewise-pdf-extractor document.pdf --output-root output
Python:
from pathlib import Path
from pagewise_pdf_extractor import ExtractionConfig, process_pdf, validate_environment
config = ExtractionConfig(
text_provider="pymupdf",
ocr_provider="marker",
fallback_provider="ollama",
fallback_enabled=True,
ollama_model="deepseek-ocr",
)
report = validate_environment(config)
if report.has_fatal_errors:
raise RuntimeError(report.summary)
result = process_pdf(
input_pdf=Path("document.pdf"),
output_root=Path("output"),
config=config,
)
Public import contract:
from pagewise_pdf_extractor import (
ExtractionConfig,
ExtractionResult,
process_pdf,
validate_environment,
)
Output
Default layout:
output/
<input_sha256>/
page_0001.md
page_0002.md
progress.json
Successful page:
# Page N
<provider markdown content>
Failed page:
# Page N
OCR FAILED
Error: <error_message>
Provider Routing
Default page-level routing:
- Try embedded text extraction with PyMuPDF.
- Accept embedded text when it meets configured quality thresholds.
- Use Marker OCR when embedded text is absent, low quality, or
force_ocr=True. - Use Ollama fallback when Marker fails or returns unusable output and fallback is enabled.
- Write failure Markdown if all configured providers fail.
Documentation
- API reference
- Configuration reference
- Environment and provider setup
- Integration guide
- Packaging and naming guide
- Releases and versioning
- Changelog
- Release notes
- Contributing
- Security
Tests
python -m unittest discover -s tests -p "test_*.py"
python -c "from pagewise_pdf_extractor import ExtractionConfig, process_pdf, validate_environment; print('ok')"
pagewise-pdf-extractor --help
Future Goals
Future work should preserve the public API, keep provider behavior explicit, and add new providers or extraction quality improvements behind documented configuration.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pagewise_pdf_extractor-0.1.1.tar.gz.
File metadata
- Download URL: pagewise_pdf_extractor-0.1.1.tar.gz
- Upload date:
- Size: 22.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b64f4579e75d2b6e99dbff4de65865f20ccf11f588369dcb9c2d4c2d12abe4dd
|
|
| MD5 |
868bb9eae702782e32771a2c06716f7e
|
|
| BLAKE2b-256 |
4afc919fa9265cad597d41002df9a02776de6c5201f1aa6bfbfaf24388ac2906
|
File details
Details for the file pagewise_pdf_extractor-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pagewise_pdf_extractor-0.1.1-py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94f7e9b2c1c112783002d42b4ec2fc493b689e697a72914b9db3d93f85681859
|
|
| MD5 |
26c3a55c0c1942782c1e6b74bf1498b4
|
|
| BLAKE2b-256 |
c908ff99457eccd4b41a0e7d6119eb310a56cd9216dba25287ecb3927b61c6ec
|