Skip to main content

Structured text extraction framework for digital and scanned PDFs with inline formatting preservation

Project description

paradox-pdf

Structured text extraction for digital and scanned PDFs — usable as a Python library or a CLI.

PyPI Python Status

Paradox parses any PDF — digital, scanned, or photographed — into a single hierarchical JSON tree of typed elements (titles, paragraphs, tables, lists, headers, signatures, …) with inline marks (bold, italic, underline, strikethrough, …) preserved. It auto-routes each page to the right pipeline (PyMuPDF font analysis for digital, YOLO + OCR + Table Transformer for scanned) and merges the results into a unified document.


Install

pip install paradox-pdf

Python 3.9+. First run will download a few HuggingFace models (~500 MB) for the vision pipeline; subsequent runs use the cache.


60-second quick start

import paradox_pdf as pdx

doc = pdx.extract("contract.pdf")

print(doc["total_pages"], "pages")
print(doc["type_summary"])     # {'TITLE': 1, 'PARAGRAPH': 14, 'TABLE': 3, ...}

for el in doc["elements"]:
    print(el["type"], "-", (el.get("text") or "")[:60])

That's it. doc is a plain dict — no custom classes, no streaming generators. JSON-serializable as-is.


Public API

The package exposes 5 functions and 1 dataclass:

Symbol Purpose
extract(pdf, **opts) -> dict Run the full pipeline, return JSON in memory.
extract_to_file(pdf, output, images_dir, **opts) -> dict Same as extract but also writes JSON + images to disk.
extract_pages(pdf, pages, **opts) -> dict Subset by page number.
extract_text(pdf, **opts) -> str Plain-text concatenation only.
extract_tables(pdf, **opts) -> list[dict] Flat list of every TABLE element.
PipelineConfig Dataclass to override 30+ thresholds.

All functions accept the same keyword options:

Argument Type Default Description
pages Sequence[int] | None None 1-based page numbers; None = all pages.
no_images bool False Skip image extraction (faster, no PNGs).
force_mode "heuristic" | "vision" | None None Force a pipeline; None auto-routes per page.
output str | Path | None None If set, also writes JSON here.
images_dir str | Path | None tempdir Where extracted images go.
config PipelineConfig | None None Override pipeline thresholds.

Examples

1. Get the document tree

import paradox_pdf as pdx

doc = pdx.extract("annual_report.pdf")

# Top-level structure
print(doc.keys())
# dict_keys(['source', 'total_pages', 'total_elements', 'total_images',
#            'type_summary', 'elements'])

# Walk the heading tree
def walk(nodes, depth=0):
    for n in nodes:
        text = (n.get("text") or "").strip()[:80]
        print(f"{'  '*depth}{n['type']:14s} {text}")
        walk(n.get("children", []), depth + 1)

walk(doc["elements"])

2. Process only certain pages

doc = pdx.extract("contract.pdf", pages=[1, 2, 5])
# or
doc = pdx.extract_pages("contract.pdf", pages=range(10, 20))

3. Plain text in one call

text = pdx.extract_text("contract.pdf")

4. Extract every table

tables = pdx.extract_tables("contract.pdf")

for t in tables:
    rows, cols = t["shape"]
    cells = t["cells"]
    print(f"Table {rows}×{cols}, {len(cells)} cells")

    for c in cells:
        p = c["p"]
        if len(p) == 2:                          # simple cell
            r, col = p
            print(f"  ({r},{col}): {c['t']!r}")
        else:                                    # merged cell
            r, col, rowspan, colspan = p
            print(f"  ({r},{col}) span {rowspan}×{colspan}: {c['t']!r}")

Cell schema:

{"p": [row, col], "t": "Some cell text"}                       # simple
{"p": [row, col, rowspan, colspan], "t": "Header cell"}        # merged

5. Persist to disk

doc = pdx.extract_to_file(
    "contract.pdf",
    output="out/contract.json",
    images_dir="out/images/",
)

The function still returns the dict.

6. Convert a folder

from pathlib import Path
import paradox_pdf as pdx

for pdf in Path("inbox/").glob("*.pdf"):
    doc = pdx.extract_to_file(pdf, output=f"out/{pdf.stem}.json", no_images=True)
    print(f"{pdf.name:40s}  {doc['total_pages']}p  {doc['total_elements']} elements")

7. Custom configuration

from paradox_pdf import extract, PipelineConfig

cfg = PipelineConfig(
    render_dpi=300,                    # higher DPI for vision pipeline
    scan_text_threshold=80,            # treat pages with <80 chars as scanned
    cv_border_missing_threshold=0.40,  # be stricter about declaring borders absent
    yolo_confidence=0.30,              # stricter YOLO detections
)

doc = extract("noisy_scan.pdf", config=cfg)

Full reference of the 30+ tunables is in docs/configuration.md.

You can also override any parameter with environment variables prefixed PDF_:

PDF_RENDER_DPI=300 PDF_YOLO_CONFIDENCE=0.3 python my_script.py

8. Force a specific pipeline

# Force the digital pipeline even if a page looks scanned (faster, no OCR)
doc = pdx.extract("digital_only.pdf", force_mode="heuristic")

# Force the vision pipeline (OCR every page, even digital ones)
doc = pdx.extract("scanned.pdf", force_mode="vision")

9. Just count things

doc = pdx.extract("contract.pdf", no_images=True)
print(doc["type_summary"])
# {'TITLE': 1, 'H1': 4, 'H2': 11, 'PARAGRAPH': 67, 'TABLE': 3, 'SIGNATURE': 2}

10. Build markdown from the tree

import paradox_pdf as pdx

LEVEL = {"TITLE": 1, "SUBTITLE": 2, "H1": 3, "H2": 4, "H3": 5, "H4": 6}

def to_markdown(nodes, out=None):
    out = out if out is not None else []
    for n in nodes:
        t = n.get("type")
        text = (n.get("text") or "").strip()
        if t in LEVEL and text:
            out.append("#" * LEVEL[t] + " " + text)
        elif t == "PARAGRAPH":
            out.append(text)
        elif t == "TABLE":
            out.append(f"_<table {n['shape'][0]}x{n['shape'][1]}>_")
        out.append("")
        to_markdown(n.get("children", []), out)
    return "\n".join(out)

doc = pdx.extract("contract.pdf", no_images=True)
print(to_markdown(doc["elements"]))

Output schema

{
  "source": "contract.pdf",
  "total_pages": 12,
  "total_elements": 145,
  "total_images": 4,
  "type_summary": {"TITLE": 1, "PARAGRAPH": 67, "TABLE": 3, "...": "..."},
  "elements": [
    {
      "type": "TITLE",
      "marks": ["BOLD"],
      "text": "**Annual Report — Q4 2025**",
      "ref": "(p1,l1):(p12,l8)",
      "children": [
        {"type": "PARAGRAPH", "text": "...", "ref": "(p1,l2):(p1,l2)"},
        {"type": "H1",
         "text": "**1. Financial Summary**",
         "ref": "(p1,l3):(p2,l4)",
         "children": [
           {"type": "TABLE",
            "shape": [5, 4],
            "cells": [
              {"p": [0, 0], "t": "Category"},
              {"p": [0, 1, 1, 3], "t": "Studio Minimum Rates"}
            ],
            "ref": "(p1,l4):(p1,l4)"}
         ]}
      ]
    }
  ]
}

ref field

Every element gets a ref of the form "(pX,lY):(pX,lY)" where:

  • pX = page number (1-based),
  • lY = element index within that page (1-based).
  • The first tuple is the start; the second is the end of the element's last descendant.

Element types (excerpt)

TITLE, SUBTITLE, H1H4, PARAGRAPH, TABLE, LIST (with items[]), TOC (with entries[]), IMAGE, SIGNATURE, AMENDMENT_DEL, EXHIBIT, APPENDIX, FOOTER, HEADER, PAGE_NUMBER, plus 50+ more. Full list: pdf_tagger/catalog.py.

Inline marks

Marks are preserved both in marks: [...] (per-element) and inline in the text:

Mark Inline syntax
BOLD **bold text**
ITALIC *italic*
UNDERLINE ++underlined++
STRIKETHROUGH ~~deleted~~
SUPERSCRIPT ^superscript^
MONOSPACE `code`

CLI

The same package installs a paradox-pdf command:

paradox-pdf contract.pdf                       # → output/contract.json
paradox-pdf contract.pdf -o result.json
paradox-pdf docs/ -o extracted/ -w 8           # parallel folder
paradox-pdf --pages 1-5 contract.pdf
paradox-pdf --no-images contract.pdf

Run paradox-pdf --help for the full set of flags.


How it works

                 ┌─────────────────┐
PDF ─────────────► scan_detector   │  per page (<50 chars → vision)
                 └────────┬────────┘
            ┌─────────────┴─────────────┐
            ▼                           ▼
   ┌──────────────────┐        ┌──────────────────────┐
   │ Heuristic        │        │ Vision               │
   │ (PyMuPDF fonts)  │        │ YOLO + RapidOCR      │
   │                  │        │ + Table Transformer  │
   │                  │        │ + HDBSCAN borderless │
   │                  │        │ + TexTAR (marks)     │
   └────────┬─────────┘        └──────────┬───────────┘
            └─────────────┬───────────────┘
                          ▼
              ┌────────────────────────┐
              │ Section tree builder   │
              │ Post-processing passes │
              └───────────┬────────────┘
                          ▼
                       JSON dict

For tables, three detectors run in parallel — vector lines (PyMuPDF), Table Transformer, OpenCV border morphology — and the highest-quality result wins by IoU 0.5 NMS scored on fill_rate + source_bonus − merge_penalty. Merged cells are detected by missing inner borders (≥35% pixel coverage threshold) for bordered tables, and by cell-width ratio (>1.6× column pitch) for borderless ones.


Performance notes

  • Digital page: ~0.05 s on CPU.
  • Scanned page: ~10 s on CPU, much faster on GPU (PyTorch detects and uses CUDA automatically).
  • First run: HuggingFace models are downloaded once (~500 MB total).

If you see multi-minute startup per document with the vision pipeline, set HF_HUB_OFFLINE=1 after the first download — HuggingFace's online metadata revalidation on slow networks is the bottleneck, not the actual inference:

HF_HUB_OFFLINE=1 python my_script.py

Or in code:

import os
os.environ["HF_HUB_OFFLINE"] = "1"
import paradox_pdf as pdx

Repository layout

paradox_pdf/         Public Python API (extract, extract_text, …)
pdf_tagger/          Core extraction (font classifier, vision layout, marks)
pdf_grid/            Vector-line table detection
scripts/             CLI implementation
docs/                Configuration reference, API reference, research notes
examples/            Sample PDFs + expected outputs
_dev/                Test suites, fixtures, benchmarks (not shipped in wheel)

License

Proprietary — © CreAI. Contact feliperodriguez@creai.mx for commercial use.


Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paradox_pdf-0.2.1.tar.gz (123.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paradox_pdf-0.2.1-py3-none-any.whl (138.1 kB view details)

Uploaded Python 3

File details

Details for the file paradox_pdf-0.2.1.tar.gz.

File metadata

  • Download URL: paradox_pdf-0.2.1.tar.gz
  • Upload date:
  • Size: 123.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paradox_pdf-0.2.1.tar.gz
Algorithm Hash digest
SHA256 872d6a2f661d7b7c304413f7bf025d809c97cffb093b8184c80b585d0a24cab7
MD5 17b59618b63f563f973687d811b99a0b
BLAKE2b-256 557401f32d7f8cbe4242b37c5be01447e334383bd9e4671a74be9408fd04b4a9

See more details on using hashes here.

File details

Details for the file paradox_pdf-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: paradox_pdf-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 138.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paradox_pdf-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 44d6991045a154a4e4c2d95e6de7cc5fcef1bc13c8dcef3e8b23c7bea6f2d638
MD5 ef1394fda5a14250f9c294a1edf3a5c7
BLAKE2b-256 3bc2022dba5ccd1d6006f2133471a81743b0b2d2e2d886c247ee49cc9d793faa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page