Structured text extraction framework for digital and scanned PDFs with inline formatting preservation
Project description
paradox-pdf
Structured text extraction for digital and scanned PDFs — usable as a Python library or a CLI.
Paradox parses any PDF — digital, scanned, or photographed — into a single hierarchical JSON tree of typed elements (titles, paragraphs, tables, lists, headers, signatures, …) with inline marks (bold, italic, underline, strikethrough, …) preserved. It auto-routes each page to the right pipeline (PyMuPDF font analysis for digital, YOLO + OCR + Table Transformer for scanned) and merges the results into a unified document.
Install
Two profiles. Pick one:
CPU (default — works on any machine)
pip install paradox-pdf
Vision pipeline: YOLO + Table Transformer + RapidOCR (ONNX) + TexTAR. Runs on CPU; uses GPU when PyTorch detects CUDA. Wheel ~60 MB; fully self-contained — bundled TexTAR weights, no extra setup.
GPU (PaddleOCR-VL 0.9B for higher accuracy)
pip install 'paradox-pdf[gpu]'
# paddlepaddle-gpu>=3.0 is not on PyPI (CUDA-version-specific wheels);
# install it from Paddle's index, picking the URL that matches your CUDA:
pip install paddlepaddle-gpu==3.2.1 \
-i https://www.paddlepaddle.org.cn/packages/stable/cu126/
Adds PaddleOCR-VL — a 0.9B VLM that does layout + OCR + table structure in one pass. Higher accuracy on photographed/distorted documents and complex tables; ~12 GB VRAM at inference.
Which one when
| Scenario | Profile |
|---|---|
| Digital PDFs (text already in PDF) | either; CPU is enough |
| Cleanly-scanned documents | CPU |
| Photographed / off-center / curved pages | GPU |
| Complex multi-level-header tables | GPU |
| Production with no GPU available | CPU |
You can also mix: in the GPU install, pass backend="cpu" to force the classic pipeline for a particular call.
Python 3.9+. First run downloads a few HuggingFace models (~500 MB CPU profile, ~2 GB GPU profile) into the local cache; subsequent runs use the cache.
60-second quick start
import paradox_pdf as pdx
doc = pdx.extract("contract.pdf")
print(doc["total_pages"], "pages")
print(doc["type_summary"]) # {'TITLE': 1, 'PARAGRAPH': 14, 'TABLE': 3, ...}
for el in doc["elements"]:
print(el["type"], "-", (el.get("text") or "")[:60])
That's it. doc is a plain dict — no custom classes, no streaming generators. JSON-serializable as-is.
Public API
The package exposes 5 functions and 1 dataclass:
| Symbol | Purpose |
|---|---|
extract(pdf, **opts) -> dict |
Run the full pipeline, return JSON in memory. |
extract_to_file(pdf, output, images_dir, **opts) -> dict |
Same as extract but also writes JSON + images to disk. |
extract_pages(pdf, pages, **opts) -> dict |
Subset by page number. |
extract_text(pdf, **opts) -> str |
Plain-text concatenation only. |
extract_tables(pdf, **opts) -> list[dict] |
Flat list of every TABLE element. |
PipelineConfig |
Dataclass to override 30+ thresholds. |
All functions accept the same keyword options:
| Argument | Type | Default | Description |
|---|---|---|---|
pages |
Sequence[int] | None |
None |
1-based page numbers; None = all pages. |
no_images |
bool |
False |
Skip image extraction (faster, no PNGs). |
force_mode |
"heuristic" | "vision" | None |
None |
Force a pipeline; None auto-routes per page. |
backend |
"auto" | "cpu" | "gpu" |
"auto" |
Vision backend: cpu = classic, gpu = PaddleOCR-VL ([gpu] extra). auto picks GPU when available. |
output |
str | Path | None |
None |
If set, also writes JSON here. |
images_dir |
str | Path | None |
tempdir | Where extracted images go. |
config |
PipelineConfig | None |
None |
Override pipeline thresholds. |
Examples
1. Get the document tree
import paradox_pdf as pdx
doc = pdx.extract("annual_report.pdf")
# Top-level structure
print(doc.keys())
# dict_keys(['source', 'total_pages', 'total_elements', 'total_images',
# 'type_summary', 'elements'])
# Walk the heading tree
def walk(nodes, depth=0):
for n in nodes:
text = (n.get("text") or "").strip()[:80]
print(f"{' '*depth}{n['type']:14s} {text}")
walk(n.get("children", []), depth + 1)
walk(doc["elements"])
2. Process only certain pages
doc = pdx.extract("contract.pdf", pages=[1, 2, 5])
# or
doc = pdx.extract_pages("contract.pdf", pages=range(10, 20))
3. Plain text in one call
text = pdx.extract_text("contract.pdf")
4. Extract every table
tables = pdx.extract_tables("contract.pdf")
for t in tables:
rows, cols = t["shape"]
cells = t["cells"]
print(f"Table {rows}×{cols}, {len(cells)} cells")
for c in cells:
p = c["p"]
if len(p) == 2: # simple cell
r, col = p
print(f" ({r},{col}): {c['t']!r}")
else: # merged cell
r, col, rowspan, colspan = p
print(f" ({r},{col}) span {rowspan}×{colspan}: {c['t']!r}")
Cell schema:
{"p": [row, col], "t": "Some cell text"} # simple
{"p": [row, col, rowspan, colspan], "t": "Header cell"} # merged
5. Persist to disk
doc = pdx.extract_to_file(
"contract.pdf",
output="out/contract.json",
images_dir="out/images/",
)
The function still returns the dict.
6. Convert a folder
from pathlib import Path
import paradox_pdf as pdx
for pdf in Path("inbox/").glob("*.pdf"):
doc = pdx.extract_to_file(pdf, output=f"out/{pdf.stem}.json", no_images=True)
print(f"{pdf.name:40s} {doc['total_pages']}p {doc['total_elements']} elements")
7. Custom configuration
from paradox_pdf import extract, PipelineConfig
cfg = PipelineConfig(
render_dpi=300, # higher DPI for vision pipeline
scan_text_threshold=80, # treat pages with <80 chars as scanned
cv_border_missing_threshold=0.40, # be stricter about declaring borders absent
yolo_confidence=0.30, # stricter YOLO detections
)
doc = extract("noisy_scan.pdf", config=cfg)
Full reference of the 30+ tunables is in docs/configuration.md.
You can also override any parameter with environment variables prefixed PDF_:
PDF_RENDER_DPI=300 PDF_YOLO_CONFIDENCE=0.3 python my_script.py
8. Force a specific pipeline
# Force the digital pipeline even if a page looks scanned (faster, no OCR)
doc = pdx.extract("digital_only.pdf", force_mode="heuristic")
# Force the vision pipeline (OCR every page, even digital ones)
doc = pdx.extract("scanned.pdf", force_mode="vision")
9. Just count things
doc = pdx.extract("contract.pdf", no_images=True)
print(doc["type_summary"])
# {'TITLE': 1, 'H1': 4, 'H2': 11, 'PARAGRAPH': 67, 'TABLE': 3, 'SIGNATURE': 2}
10. Build markdown from the tree
import paradox_pdf as pdx
LEVEL = {"TITLE": 1, "SUBTITLE": 2, "H1": 3, "H2": 4, "H3": 5, "H4": 6}
def to_markdown(nodes, out=None):
out = out if out is not None else []
for n in nodes:
t = n.get("type")
text = (n.get("text") or "").strip()
if t in LEVEL and text:
out.append("#" * LEVEL[t] + " " + text)
elif t == "PARAGRAPH":
out.append(text)
elif t == "TABLE":
out.append(f"_<table {n['shape'][0]}x{n['shape'][1]}>_")
out.append("")
to_markdown(n.get("children", []), out)
return "\n".join(out)
doc = pdx.extract("contract.pdf", no_images=True)
print(to_markdown(doc["elements"]))
Output schema
{
"source": "contract.pdf",
"total_pages": 12,
"total_elements": 145,
"total_images": 4,
"type_summary": {"TITLE": 1, "PARAGRAPH": 67, "TABLE": 3, "...": "..."},
"elements": [
{
"type": "TITLE",
"marks": ["BOLD"],
"text": "**Annual Report — Q4 2025**",
"ref": "(p1,l1):(p12,l8)",
"children": [
{"type": "PARAGRAPH", "text": "...", "ref": "(p1,l2):(p1,l2)"},
{"type": "H1",
"text": "**1. Financial Summary**",
"ref": "(p1,l3):(p2,l4)",
"children": [
{"type": "TABLE",
"shape": [5, 4],
"cells": [
{"p": [0, 0], "t": "Category"},
{"p": [0, 1, 1, 3], "t": "Studio Minimum Rates"}
],
"ref": "(p1,l4):(p1,l4)"}
]}
]
}
]
}
ref field
Every element gets a ref of the form "(pX,lY):(pX,lY)" where:
pX= page number (1-based),lY= element index within that page (1-based).- The first tuple is the start; the second is the end of the element's last descendant.
Element types (excerpt)
TITLE, SUBTITLE, H1–H4, PARAGRAPH, TABLE, LIST (with items[]), TOC (with entries[]), IMAGE, SIGNATURE, AMENDMENT_DEL, EXHIBIT, APPENDIX, FOOTER, HEADER, PAGE_NUMBER, plus 50+ more. Full list: pdf_tagger/catalog.py.
Inline marks
Marks are preserved both in marks: [...] (per-element) and inline in the text:
| Mark | Inline syntax |
|---|---|
| BOLD | **bold text** |
| ITALIC | *italic* |
| UNDERLINE | ++underlined++ |
| STRIKETHROUGH | ~~deleted~~ |
| SUPERSCRIPT | ^superscript^ |
| MONOSPACE | `code` |
CLI
The same package installs a paradox-pdf command:
paradox-pdf contract.pdf # → output/contract.json
paradox-pdf contract.pdf -o result.json
paradox-pdf docs/ -o extracted/ -w 8 # parallel folder
paradox-pdf --pages 1-5 contract.pdf
paradox-pdf --no-images contract.pdf
Run paradox-pdf --help for the full set of flags.
How it works
┌─────────────────┐
PDF ─────────────► scan_detector │ per page (<50 chars → vision)
└────────┬────────┘
┌─────────────┴─────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ Heuristic │ │ Vision │
│ (PyMuPDF fonts) │ │ YOLO + RapidOCR │
│ │ │ + Table Transformer │
│ │ │ + HDBSCAN borderless │
│ │ │ + TexTAR (marks) │
└────────┬─────────┘ └──────────┬───────────┘
└─────────────┬───────────────┘
▼
┌────────────────────────┐
│ Section tree builder │
│ Post-processing passes │
└───────────┬────────────┘
▼
JSON dict
For tables, three detectors run in parallel — vector lines (PyMuPDF), Table Transformer, OpenCV border morphology — and the highest-quality result wins by IoU 0.5 NMS scored on fill_rate + source_bonus − merge_penalty. Merged cells are detected by missing inner borders (≥35% pixel coverage threshold) for bordered tables, and by cell-width ratio (>1.6× column pitch) for borderless ones.
Performance notes
- Digital page: ~0.05 s on CPU.
- Scanned page: ~10 s on CPU, much faster on GPU (PyTorch detects and uses CUDA automatically).
- First run: HuggingFace models are downloaded once (~500 MB total).
If you see multi-minute startup per document with the vision pipeline, set HF_HUB_OFFLINE=1 after the first download — HuggingFace's online metadata revalidation on slow networks is the bottleneck, not the actual inference:
HF_HUB_OFFLINE=1 python my_script.py
Or in code:
import os
os.environ["HF_HUB_OFFLINE"] = "1"
import paradox_pdf as pdx
Repository layout
paradox_pdf/ Public Python API (extract, extract_text, …)
pdf_tagger/ Core extraction (font classifier, vision layout, marks)
pdf_grid/ Vector-line table detection
scripts/ CLI implementation
docs/ Configuration reference, API reference, research notes
examples/ Sample PDFs + expected outputs
_dev/ Test suites, fixtures, benchmarks (not shipped in wheel)
License
Proprietary — © CreAI. Contact feliperodriguez@creai.mx for commercial use.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paradox_pdf-0.3.0.tar.gz.
File metadata
- Download URL: paradox_pdf-0.3.0.tar.gz
- Upload date:
- Size: 61.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de8162ab6272ddc8fe53533f5466993a33c1b69023a4ab71632052dc10881768
|
|
| MD5 |
e27f001948a7d66b178a5d323bf2c824
|
|
| BLAKE2b-256 |
0a12d9cbc2d422804afa41b164a0fab131884aa52ec543993fa9735070d66b56
|
File details
Details for the file paradox_pdf-0.3.0-py3-none-any.whl.
File metadata
- Download URL: paradox_pdf-0.3.0-py3-none-any.whl
- Upload date:
- Size: 61.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
712d2e7a8f82eb2123d49aeddefea71679628cbf43471cac5a0710604b00b3eb
|
|
| MD5 |
c581aea228fdd4c271399e3c786018ea
|
|
| BLAKE2b-256 |
6fc75fdeea2ef59bf5511be102f627458cebcd1435655960152eb6504a8a2685
|