Skip to main content

Structured text extraction framework for digital and scanned PDFs with inline formatting preservation

Project description

paradox-pdf · [cpu] or [gpu]

Structured text extraction for digital and scanned PDFs — usable as a Python library or a CLI.

Two install profiles, one wheel. Pick CPU for portability, GPU for higher accuracy on photographed/distorted documents.

PyPI Python Status

pip install paradox-pdf            # CPU profile (default — works everywhere)
pip install 'paradox-pdf[gpu]'     # GPU profile (PaddleOCR-VL 0.9B, needs CUDA)

Paradox parses any PDF — digital, scanned, or photographed — into a single hierarchical JSON tree of typed elements (titles, paragraphs, tables, lists, headers, signatures, …) with inline marks (bold, italic, underline, strikethrough, …) preserved. It auto-routes each page to the right pipeline (PyMuPDF font analysis for digital, YOLO + OCR + Table Transformer for scanned) and merges the results into a unified document.


Install

Two profiles. Pick one:

CPU (default — works on any machine)

pip install paradox-pdf

Vision pipeline: YOLO + Table Transformer + RapidOCR (ONNX) + TexTAR. Runs on CPU; uses GPU when PyTorch detects CUDA. Wheel ~60 MB; fully self-contained — bundled TexTAR weights, no extra setup.

GPU (PaddleOCR-VL 0.9B for higher accuracy)

pip install 'paradox-pdf[gpu]'

# paddlepaddle-gpu>=3.0 is not on PyPI (CUDA-version-specific wheels);
# install it from Paddle's index, picking the URL that matches your CUDA:
pip install paddlepaddle-gpu==3.2.1 \
  -i https://www.paddlepaddle.org.cn/packages/stable/cu126/

Adds PaddleOCR-VL — a 0.9B VLM that does layout + OCR + table structure in one pass. Higher accuracy on photographed/distorted documents and complex tables; ~12 GB VRAM at inference.

Which one when

Scenario Profile
Digital PDFs (text already in PDF) either; CPU is enough
Cleanly-scanned documents CPU
Photographed / off-center / curved pages GPU
Complex multi-level-header tables GPU
Production with no GPU available CPU

You can also mix: in the GPU install, pass backend="cpu" to force the classic pipeline for a particular call.

Python 3.9+. First run downloads a few HuggingFace models (~500 MB CPU profile, ~2 GB GPU profile) into the local cache; subsequent runs use the cache.


60-second quick start

import paradox_pdf as pdx

doc = pdx.extract("contract.pdf")

print(doc["total_pages"], "pages")
print(doc["type_summary"])     # {'TITLE': 1, 'PARAGRAPH': 14, 'TABLE': 3, ...}

for el in doc["elements"]:
    print(el["type"], "-", (el.get("text") or "")[:60])

That's it. doc is a plain dict — no custom classes, no streaming generators. JSON-serializable as-is.


Public API

The package exposes 5 functions and 1 dataclass:

Symbol Purpose
extract(pdf, **opts) -> dict Run the full pipeline, return JSON in memory.
extract_to_file(pdf, output, images_dir, **opts) -> dict Same as extract but also writes JSON + images to disk.
extract_pages(pdf, pages, **opts) -> dict Subset by page number.
extract_text(pdf, **opts) -> str Plain-text concatenation only.
extract_tables(pdf, **opts) -> list[dict] Flat list of every TABLE element.
PipelineConfig Dataclass to override 30+ thresholds.

All functions accept the same keyword options:

Argument Type Default Description
pages Sequence[int] | None None 1-based page numbers; None = all pages.
no_images bool False Skip image extraction (faster, no PNGs).
force_mode "heuristic" | "vision" | None None Force a pipeline; None auto-routes per page.
backend "auto" | "cpu" | "gpu" "auto" Vision backend: cpu = classic, gpu = PaddleOCR-VL ([gpu] extra). auto picks GPU when available.
output str | Path | None None If set, also writes JSON here.
images_dir str | Path | None tempdir Where extracted images go.
config PipelineConfig | None None Override pipeline thresholds.

Examples

1. Get the document tree

import paradox_pdf as pdx

doc = pdx.extract("annual_report.pdf")

# Top-level structure
print(doc.keys())
# dict_keys(['source', 'total_pages', 'total_elements', 'total_images',
#            'type_summary', 'elements'])

# Walk the heading tree
def walk(nodes, depth=0):
    for n in nodes:
        text = (n.get("text") or "").strip()[:80]
        print(f"{'  '*depth}{n['type']:14s} {text}")
        walk(n.get("children", []), depth + 1)

walk(doc["elements"])

2. Process only certain pages

doc = pdx.extract("contract.pdf", pages=[1, 2, 5])
# or
doc = pdx.extract_pages("contract.pdf", pages=range(10, 20))

3. Plain text in one call

text = pdx.extract_text("contract.pdf")

4. Extract every table

tables = pdx.extract_tables("contract.pdf")

for t in tables:
    rows, cols = t["shape"]
    cells = t["cells"]
    print(f"Table {rows}×{cols}, {len(cells)} cells")

    for c in cells:
        p = c["p"]
        if len(p) == 2:                          # simple cell
            r, col = p
            print(f"  ({r},{col}): {c['t']!r}")
        else:                                    # merged cell
            r, col, rowspan, colspan = p
            print(f"  ({r},{col}) span {rowspan}×{colspan}: {c['t']!r}")

Cell schema:

{"p": [row, col], "t": "Some cell text"}                       # simple
{"p": [row, col, rowspan, colspan], "t": "Header cell"}        # merged

5. Persist to disk

doc = pdx.extract_to_file(
    "contract.pdf",
    output="out/contract.json",
    images_dir="out/images/",
)

The function still returns the dict.

6. Convert a folder

from pathlib import Path
import paradox_pdf as pdx

for pdf in Path("inbox/").glob("*.pdf"):
    doc = pdx.extract_to_file(pdf, output=f"out/{pdf.stem}.json", no_images=True)
    print(f"{pdf.name:40s}  {doc['total_pages']}p  {doc['total_elements']} elements")

7. Custom configuration

from paradox_pdf import extract, PipelineConfig

cfg = PipelineConfig(
    render_dpi=300,                    # higher DPI for vision pipeline
    scan_text_threshold=80,            # treat pages with <80 chars as scanned
    cv_border_missing_threshold=0.40,  # be stricter about declaring borders absent
    yolo_confidence=0.30,              # stricter YOLO detections
)

doc = extract("noisy_scan.pdf", config=cfg)

Full reference of the 30+ tunables is in docs/configuration.md.

You can also override any parameter with environment variables prefixed PDF_:

PDF_RENDER_DPI=300 PDF_YOLO_CONFIDENCE=0.3 python my_script.py

8. Force a specific pipeline

# Force the digital pipeline even if a page looks scanned (faster, no OCR)
doc = pdx.extract("digital_only.pdf", force_mode="heuristic")

# Force the vision pipeline (OCR every page, even digital ones)
doc = pdx.extract("scanned.pdf", force_mode="vision")

9. Just count things

doc = pdx.extract("contract.pdf", no_images=True)
print(doc["type_summary"])
# {'TITLE': 1, 'H1': 4, 'H2': 11, 'PARAGRAPH': 67, 'TABLE': 3, 'SIGNATURE': 2}

10. Build markdown from the tree

import paradox_pdf as pdx

LEVEL = {"TITLE": 1, "SUBTITLE": 2, "H1": 3, "H2": 4, "H3": 5, "H4": 6}

def to_markdown(nodes, out=None):
    out = out if out is not None else []
    for n in nodes:
        t = n.get("type")
        text = (n.get("text") or "").strip()
        if t in LEVEL and text:
            out.append("#" * LEVEL[t] + " " + text)
        elif t == "PARAGRAPH":
            out.append(text)
        elif t == "TABLE":
            out.append(f"_<table {n['shape'][0]}x{n['shape'][1]}>_")
        out.append("")
        to_markdown(n.get("children", []), out)
    return "\n".join(out)

doc = pdx.extract("contract.pdf", no_images=True)
print(to_markdown(doc["elements"]))

Output schema

{
  "source": "contract.pdf",
  "total_pages": 12,
  "total_elements": 145,
  "total_images": 4,
  "type_summary": {"TITLE": 1, "PARAGRAPH": 67, "TABLE": 3, "...": "..."},
  "elements": [
    {
      "type": "TITLE",
      "marks": ["BOLD"],
      "text": "**Annual Report — Q4 2025**",
      "ref": "(p1,l1):(p12,l8)",
      "children": [
        {"type": "PARAGRAPH", "text": "...", "ref": "(p1,l2):(p1,l2)"},
        {"type": "H1",
         "text": "**1. Financial Summary**",
         "ref": "(p1,l3):(p2,l4)",
         "children": [
           {"type": "TABLE",
            "shape": [5, 4],
            "cells": [
              {"p": [0, 0], "t": "Category"},
              {"p": [0, 1, 1, 3], "t": "Studio Minimum Rates"}
            ],
            "ref": "(p1,l4):(p1,l4)"}
         ]}
      ]
    }
  ]
}

ref field

Every element gets a ref of the form "(pX,lY):(pX,lY)" where:

  • pX = page number (1-based),
  • lY = element index within that page (1-based).
  • The first tuple is the start; the second is the end of the element's last descendant.

Element types (excerpt)

TITLE, SUBTITLE, H1H4, PARAGRAPH, TABLE, LIST (with items[]), TOC (with entries[]), IMAGE, SIGNATURE, AMENDMENT_DEL, EXHIBIT, APPENDIX, FOOTER, HEADER, PAGE_NUMBER, plus 50+ more. Full list: pdf_tagger/catalog.py.

Inline marks

Marks are preserved both in marks: [...] (per-element) and inline in the text:

Mark Inline syntax
BOLD **bold text**
ITALIC *italic*
UNDERLINE ++underlined++
STRIKETHROUGH ~~deleted~~
SUPERSCRIPT ^superscript^
MONOSPACE `code`

CLI

The same package installs a paradox-pdf command:

paradox-pdf contract.pdf                       # → output/contract.json
paradox-pdf contract.pdf -o result.json
paradox-pdf docs/ -o extracted/ -w 8           # parallel folder
paradox-pdf --pages 1-5 contract.pdf
paradox-pdf --no-images contract.pdf

Run paradox-pdf --help for the full set of flags.


How it works

                 ┌─────────────────┐
PDF ─────────────► scan_detector   │  per page (<50 chars → vision)
                 └────────┬────────┘
            ┌─────────────┴─────────────┐
            ▼                           ▼
   ┌──────────────────┐        ┌──────────────────────┐
   │ Heuristic        │        │ Vision               │
   │ (PyMuPDF fonts)  │        │ YOLO + RapidOCR      │
   │                  │        │ + Table Transformer  │
   │                  │        │ + HDBSCAN borderless │
   │                  │        │ + TexTAR (marks)     │
   └────────┬─────────┘        └──────────┬───────────┘
            └─────────────┬───────────────┘
                          ▼
              ┌────────────────────────┐
              │ Section tree builder   │
              │ Post-processing passes │
              └───────────┬────────────┘
                          ▼
                       JSON dict

For tables, three detectors run in parallel — vector lines (PyMuPDF), Table Transformer, OpenCV border morphology — and the highest-quality result wins by IoU 0.5 NMS scored on fill_rate + source_bonus − merge_penalty. Merged cells are detected by missing inner borders (≥35% pixel coverage threshold) for bordered tables, and by cell-width ratio (>1.6× column pitch) for borderless ones.


Performance notes

  • Digital page: ~0.05 s on CPU.
  • Scanned page: ~10 s on CPU, much faster on GPU (PyTorch detects and uses CUDA automatically).
  • First run: HuggingFace models are downloaded once (~500 MB total).

If you see multi-minute startup per document with the vision pipeline, set HF_HUB_OFFLINE=1 after the first download — HuggingFace's online metadata revalidation on slow networks is the bottleneck, not the actual inference:

HF_HUB_OFFLINE=1 python my_script.py

Or in code:

import os
os.environ["HF_HUB_OFFLINE"] = "1"
import paradox_pdf as pdx

Repository layout

paradox_pdf/         Public Python API (extract, extract_text, …)
pdf_tagger/          Core extraction (font classifier, vision layout, marks)
pdf_grid/            Vector-line table detection
scripts/             CLI implementation
docs/                Configuration reference, API reference, research notes
examples/            Sample PDFs + expected outputs
_dev/                Test suites, fixtures, benchmarks (not shipped in wheel)

License

Proprietary — © CreAI. Contact feliperodriguez@creai.mx for commercial use.


Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paradox_pdf-0.4.0.tar.gz (61.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paradox_pdf-0.4.0-py3-none-any.whl (61.4 MB view details)

Uploaded Python 3

File details

Details for the file paradox_pdf-0.4.0.tar.gz.

File metadata

  • Download URL: paradox_pdf-0.4.0.tar.gz
  • Upload date:
  • Size: 61.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paradox_pdf-0.4.0.tar.gz
Algorithm Hash digest
SHA256 a8abc933a46055bb1cf8af936040035c965d2f754fa2fa4d1ec16cfa811d55ed
MD5 634f91cd2138ecd2502eb2d1e42a9ac2
BLAKE2b-256 ca1d56ac0d716c11239c9ad549b37d8084979f68735d7a6eeed38353fa294e64

See more details on using hashes here.

File details

Details for the file paradox_pdf-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: paradox_pdf-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 61.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paradox_pdf-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ca3267ef038e72483062db9e7071626fd752c8bbe55f2ea59bcac832d8c9aa72
MD5 e318ab0671f39fc42c8ce2c25f543eda
BLAKE2b-256 0f4be96d495b90a5d7c759481e5f8cfd456ff40bcd1ae5fd44d02d22434dd8b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page