Skip to main content

Structured text extraction framework for digital and scanned PDFs with inline formatting preservation

Project description

paradox-pdf

Structured text extraction for digital and scanned PDFs — usable as a Python library or a CLI.

PyPI Python Status

Paradox parses any PDF — digital, scanned, or photographed — into a single hierarchical JSON tree of typed elements (titles, paragraphs, tables, lists, headers, signatures, …) with inline marks (bold, italic, underline, strikethrough, …) preserved. It auto-routes each page to the right pipeline (PyMuPDF font analysis for digital, YOLO + OCR + Table Transformer for scanned) and merges the results into a unified document.


Install

pip install paradox-pdf

Python 3.9+. First run will download a few HuggingFace models (~500 MB) for the vision pipeline; subsequent runs use the cache.


60-second quick start

import paradox_pdf as pdx

doc = pdx.extract("contract.pdf")

print(doc["total_pages"], "pages")
print(doc["type_summary"])     # {'TITLE': 1, 'PARAGRAPH': 14, 'TABLE': 3, ...}

for el in doc["elements"]:
    print(el["type"], "-", (el.get("text") or "")[:60])

That's it. doc is a plain dict — no custom classes, no streaming generators. JSON-serializable as-is.


Public API

The package exposes 5 functions and 1 dataclass:

Symbol Purpose
extract(pdf, **opts) -> dict Run the full pipeline, return JSON in memory.
extract_to_file(pdf, output, images_dir, **opts) -> dict Same as extract but also writes JSON + images to disk.
extract_pages(pdf, pages, **opts) -> dict Subset by page number.
extract_text(pdf, **opts) -> str Plain-text concatenation only.
extract_tables(pdf, **opts) -> list[dict] Flat list of every TABLE element.
PipelineConfig Dataclass to override 30+ thresholds.

All functions accept the same keyword options:

Argument Type Default Description
pages Sequence[int] | None None 1-based page numbers; None = all pages.
no_images bool False Skip image extraction (faster, no PNGs).
force_mode "heuristic" | "vision" | None None Force a pipeline; None auto-routes per page.
output str | Path | None None If set, also writes JSON here.
images_dir str | Path | None tempdir Where extracted images go.
config PipelineConfig | None None Override pipeline thresholds.

Examples

1. Get the document tree

import paradox_pdf as pdx

doc = pdx.extract("annual_report.pdf")

# Top-level structure
print(doc.keys())
# dict_keys(['source', 'total_pages', 'total_elements', 'total_images',
#            'type_summary', 'elements'])

# Walk the heading tree
def walk(nodes, depth=0):
    for n in nodes:
        text = (n.get("text") or "").strip()[:80]
        print(f"{'  '*depth}{n['type']:14s} {text}")
        walk(n.get("children", []), depth + 1)

walk(doc["elements"])

2. Process only certain pages

doc = pdx.extract("contract.pdf", pages=[1, 2, 5])
# or
doc = pdx.extract_pages("contract.pdf", pages=range(10, 20))

3. Plain text in one call

text = pdx.extract_text("contract.pdf")

4. Extract every table

tables = pdx.extract_tables("contract.pdf")

for t in tables:
    rows, cols = t["shape"]
    cells = t["cells"]
    print(f"Table {rows}×{cols}, {len(cells)} cells")

    for c in cells:
        p = c["p"]
        if len(p) == 2:                          # simple cell
            r, col = p
            print(f"  ({r},{col}): {c['t']!r}")
        else:                                    # merged cell
            r, col, rowspan, colspan = p
            print(f"  ({r},{col}) span {rowspan}×{colspan}: {c['t']!r}")

Cell schema:

{"p": [row, col], "t": "Some cell text"}                       # simple
{"p": [row, col, rowspan, colspan], "t": "Header cell"}        # merged

5. Persist to disk

doc = pdx.extract_to_file(
    "contract.pdf",
    output="out/contract.json",
    images_dir="out/images/",
)

The function still returns the dict.

6. Convert a folder

from pathlib import Path
import paradox_pdf as pdx

for pdf in Path("inbox/").glob("*.pdf"):
    doc = pdx.extract_to_file(pdf, output=f"out/{pdf.stem}.json", no_images=True)
    print(f"{pdf.name:40s}  {doc['total_pages']}p  {doc['total_elements']} elements")

7. Custom configuration

from paradox_pdf import extract, PipelineConfig

cfg = PipelineConfig(
    render_dpi=300,                    # higher DPI for vision pipeline
    scan_text_threshold=80,            # treat pages with <80 chars as scanned
    cv_border_missing_threshold=0.40,  # be stricter about declaring borders absent
    yolo_confidence=0.30,              # stricter YOLO detections
)

doc = extract("noisy_scan.pdf", config=cfg)

Full reference of the 30+ tunables is in docs/configuration.md.

You can also override any parameter with environment variables prefixed PDF_:

PDF_RENDER_DPI=300 PDF_YOLO_CONFIDENCE=0.3 python my_script.py

8. Force a specific pipeline

# Force the digital pipeline even if a page looks scanned (faster, no OCR)
doc = pdx.extract("digital_only.pdf", force_mode="heuristic")

# Force the vision pipeline (OCR every page, even digital ones)
doc = pdx.extract("scanned.pdf", force_mode="vision")

9. Just count things

doc = pdx.extract("contract.pdf", no_images=True)
print(doc["type_summary"])
# {'TITLE': 1, 'H1': 4, 'H2': 11, 'PARAGRAPH': 67, 'TABLE': 3, 'SIGNATURE': 2}

10. Build markdown from the tree

import paradox_pdf as pdx

LEVEL = {"TITLE": 1, "SUBTITLE": 2, "H1": 3, "H2": 4, "H3": 5, "H4": 6}

def to_markdown(nodes, out=None):
    out = out if out is not None else []
    for n in nodes:
        t = n.get("type")
        text = (n.get("text") or "").strip()
        if t in LEVEL and text:
            out.append("#" * LEVEL[t] + " " + text)
        elif t == "PARAGRAPH":
            out.append(text)
        elif t == "TABLE":
            out.append(f"_<table {n['shape'][0]}x{n['shape'][1]}>_")
        out.append("")
        to_markdown(n.get("children", []), out)
    return "\n".join(out)

doc = pdx.extract("contract.pdf", no_images=True)
print(to_markdown(doc["elements"]))

Output schema

{
  "source": "contract.pdf",
  "total_pages": 12,
  "total_elements": 145,
  "total_images": 4,
  "type_summary": {"TITLE": 1, "PARAGRAPH": 67, "TABLE": 3, "...": "..."},
  "elements": [
    {
      "type": "TITLE",
      "marks": ["BOLD"],
      "text": "**Annual Report — Q4 2025**",
      "ref": "(p1,l1):(p12,l8)",
      "children": [
        {"type": "PARAGRAPH", "text": "...", "ref": "(p1,l2):(p1,l2)"},
        {"type": "H1",
         "text": "**1. Financial Summary**",
         "ref": "(p1,l3):(p2,l4)",
         "children": [
           {"type": "TABLE",
            "shape": [5, 4],
            "cells": [
              {"p": [0, 0], "t": "Category"},
              {"p": [0, 1, 1, 3], "t": "Studio Minimum Rates"}
            ],
            "ref": "(p1,l4):(p1,l4)"}
         ]}
      ]
    }
  ]
}

ref field

Every element gets a ref of the form "(pX,lY):(pX,lY)" where:

  • pX = page number (1-based),
  • lY = element index within that page (1-based).
  • The first tuple is the start; the second is the end of the element's last descendant.

Element types (excerpt)

TITLE, SUBTITLE, H1H4, PARAGRAPH, TABLE, LIST (with items[]), TOC (with entries[]), IMAGE, SIGNATURE, AMENDMENT_DEL, EXHIBIT, APPENDIX, FOOTER, HEADER, PAGE_NUMBER, plus 50+ more. Full list: pdf_tagger/catalog.py.

Inline marks

Marks are preserved both in marks: [...] (per-element) and inline in the text:

Mark Inline syntax
BOLD **bold text**
ITALIC *italic*
UNDERLINE ++underlined++
STRIKETHROUGH ~~deleted~~
SUPERSCRIPT ^superscript^
MONOSPACE `code`

CLI

The same package installs a paradox-pdf command:

paradox-pdf contract.pdf                       # → output/contract.json
paradox-pdf contract.pdf -o result.json
paradox-pdf docs/ -o extracted/ -w 8           # parallel folder
paradox-pdf --pages 1-5 contract.pdf
paradox-pdf --no-images contract.pdf

Run paradox-pdf --help for the full set of flags.


How it works

                 ┌─────────────────┐
PDF ─────────────► scan_detector   │  per page (<50 chars → vision)
                 └────────┬────────┘
            ┌─────────────┴─────────────┐
            ▼                           ▼
   ┌──────────────────┐        ┌──────────────────────┐
   │ Heuristic        │        │ Vision               │
   │ (PyMuPDF fonts)  │        │ YOLO + RapidOCR      │
   │                  │        │ + Table Transformer  │
   │                  │        │ + HDBSCAN borderless │
   │                  │        │ + TexTAR (marks)     │
   └────────┬─────────┘        └──────────┬───────────┘
            └─────────────┬───────────────┘
                          ▼
              ┌────────────────────────┐
              │ Section tree builder   │
              │ Post-processing passes │
              └───────────┬────────────┘
                          ▼
                       JSON dict

For tables, three detectors run in parallel — vector lines (PyMuPDF), Table Transformer, OpenCV border morphology — and the highest-quality result wins by IoU 0.5 NMS scored on fill_rate + source_bonus − merge_penalty. Merged cells are detected by missing inner borders (≥35% pixel coverage threshold) for bordered tables, and by cell-width ratio (>1.6× column pitch) for borderless ones.


Performance notes

  • Digital page: ~0.05 s on CPU.
  • Scanned page: ~10 s on CPU, much faster on GPU (PyTorch detects and uses CUDA automatically).
  • First run: HuggingFace models are downloaded once (~500 MB total).

If you see multi-minute startup per document with the vision pipeline, set HF_HUB_OFFLINE=1 after the first download — HuggingFace's online metadata revalidation on slow networks is the bottleneck, not the actual inference:

HF_HUB_OFFLINE=1 python my_script.py

Or in code:

import os
os.environ["HF_HUB_OFFLINE"] = "1"
import paradox_pdf as pdx

Repository layout

paradox_pdf/         Public Python API (extract, extract_text, …)
pdf_tagger/          Core extraction (font classifier, vision layout, marks)
pdf_grid/            Vector-line table detection
scripts/             CLI implementation
docs/                Configuration reference, API reference, research notes
examples/            Sample PDFs + expected outputs
_dev/                Test suites, fixtures, benchmarks (not shipped in wheel)

License

Proprietary — © CreAI. Contact feliperodriguez@creai.mx for commercial use.


Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paradox_pdf-0.2.2.tar.gz (61.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paradox_pdf-0.2.2-py3-none-any.whl (61.4 MB view details)

Uploaded Python 3

File details

Details for the file paradox_pdf-0.2.2.tar.gz.

File metadata

  • Download URL: paradox_pdf-0.2.2.tar.gz
  • Upload date:
  • Size: 61.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paradox_pdf-0.2.2.tar.gz
Algorithm Hash digest
SHA256 b04a2515c79a9617fe9945966ae943a6bd8a1b47f87d88443fb49b501b95538d
MD5 5b09782d471c2903951631408f699383
BLAKE2b-256 2ee5b6a607e4c3f2fe0dc52a047c38245a7bc92e096faef0ba69760999dfa70c

See more details on using hashes here.

File details

Details for the file paradox_pdf-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: paradox_pdf-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 61.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paradox_pdf-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c963c48f5e18f8d41147bed67c5dc768c13edf8411b0c22854f0b7b50032da37
MD5 3dac6d6392381c56685f197e7cedff3a
BLAKE2b-256 9ab385c20fbaf7caf73b5446407cf7ab6b6d064e850a4cdbc783bb44d8e5c6a8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page