Structured text extraction framework for digital and scanned PDFs with inline formatting preservation

These details have not been verified by PyPI

Project links

Project description

paradox-pdf

Structured text extraction for digital and scanned PDFs — usable as a Python library or a CLI.

Paradox parses any PDF — digital, scanned, or photographed — into a single hierarchical JSON tree of typed elements (titles, paragraphs, tables, lists, headers, signatures, …) with inline marks (bold, italic, underline, strikethrough, …) preserved. It auto-routes each page to the right pipeline (PyMuPDF font analysis for digital, YOLO + OCR + Table Transformer for scanned) and merges the results into a unified document.

Install

pip install paradox-pdf

Python 3.9+. First run will download a few HuggingFace models (~500 MB) for the vision pipeline; subsequent runs use the cache.

60-second quick start

import paradox_pdf as pdx

doc = pdx.extract("contract.pdf")

print(doc["total_pages"], "pages")
print(doc["type_summary"])     # {'TITLE': 1, 'PARAGRAPH': 14, 'TABLE': 3, ...}

for el in doc["elements"]:
    print(el["type"], "-", (el.get("text") or "")[:60])

That's it. doc is a plain dict — no custom classes, no streaming generators. JSON-serializable as-is.

Public API

The package exposes 5 functions and 1 dataclass:

Symbol	Purpose
`extract(pdf, **opts) -> dict`	Run the full pipeline, return JSON in memory.
`extract_to_file(pdf, output, images_dir, **opts) -> dict`	Same as `extract` but also writes JSON + images to disk.
`extract_pages(pdf, pages, **opts) -> dict`	Subset by page number.
`extract_text(pdf, **opts) -> str`	Plain-text concatenation only.
`extract_tables(pdf, **opts) -> list[dict]`	Flat list of every TABLE element.
`PipelineConfig`	Dataclass to override 30+ thresholds.

All functions accept the same keyword options:

Argument	Type	Default	Description
`pages`	`Sequence[int] \| None`	`None`	1-based page numbers; `None` = all pages.
`no_images`	`bool`	`False`	Skip image extraction (faster, no PNGs).
`force_mode`	`"heuristic" \| "vision" \| None`	`None`	Force a pipeline; `None` auto-routes per page.
`output`	`str \| Path \| None`	`None`	If set, also writes JSON here.
`images_dir`	`str \| Path \| None`	tempdir	Where extracted images go.
`config`	`PipelineConfig \| None`	`None`	Override pipeline thresholds.

Examples

1. Get the document tree

import paradox_pdf as pdx

doc = pdx.extract("annual_report.pdf")

# Top-level structure
print(doc.keys())
# dict_keys(['source', 'total_pages', 'total_elements', 'total_images',
#            'type_summary', 'elements'])

# Walk the heading tree
def walk(nodes, depth=0):
    for n in nodes:
        text = (n.get("text") or "").strip()[:80]
        print(f"{'  '*depth}{n['type']:14s} {text}")
        walk(n.get("children", []), depth + 1)

walk(doc["elements"])

2. Process only certain pages

doc = pdx.extract("contract.pdf", pages=[1, 2, 5])
# or
doc = pdx.extract_pages("contract.pdf", pages=range(10, 20))

3. Plain text in one call

text = pdx.extract_text("contract.pdf")

4. Extract every table

tables = pdx.extract_tables("contract.pdf")

for t in tables:
    rows, cols = t["shape"]
    cells = t["cells"]
    print(f"Table {rows}×{cols}, {len(cells)} cells")

    for c in cells:
        p = c["p"]
        if len(p) == 2:                          # simple cell
            r, col = p
            print(f"  ({r},{col}): {c['t']!r}")
        else:                                    # merged cell
            r, col, rowspan, colspan = p
            print(f"  ({r},{col}) span {rowspan}×{colspan}: {c['t']!r}")

Cell schema:

{"p": [row, col], "t": "Some cell text"}                       # simple
{"p": [row, col, rowspan, colspan], "t": "Header cell"}        # merged

5. Persist to disk

doc = pdx.extract_to_file(
    "contract.pdf",
    output="out/contract.json",
    images_dir="out/images/",
)

The function still returns the dict.

6. Convert a folder

from pathlib import Path
import paradox_pdf as pdx

for pdf in Path("inbox/").glob("*.pdf"):
    doc = pdx.extract_to_file(pdf, output=f"out/{pdf.stem}.json", no_images=True)
    print(f"{pdf.name:40s}  {doc['total_pages']}p  {doc['total_elements']} elements")

7. Custom configuration

from paradox_pdf import extract, PipelineConfig

cfg = PipelineConfig(
    render_dpi=300,                    # higher DPI for vision pipeline
    scan_text_threshold=80,            # treat pages with <80 chars as scanned
    cv_border_missing_threshold=0.40,  # be stricter about declaring borders absent
    yolo_confidence=0.30,              # stricter YOLO detections
)

doc = extract("noisy_scan.pdf", config=cfg)

Full reference of the 30+ tunables is in docs/configuration.md.

You can also override any parameter with environment variables prefixed PDF_:

PDF_RENDER_DPI=300 PDF_YOLO_CONFIDENCE=0.3 python my_script.py

8. Force a specific pipeline

# Force the digital pipeline even if a page looks scanned (faster, no OCR)
doc = pdx.extract("digital_only.pdf", force_mode="heuristic")

# Force the vision pipeline (OCR every page, even digital ones)
doc = pdx.extract("scanned.pdf", force_mode="vision")

9. Just count things

doc = pdx.extract("contract.pdf", no_images=True)
print(doc["type_summary"])
# {'TITLE': 1, 'H1': 4, 'H2': 11, 'PARAGRAPH': 67, 'TABLE': 3, 'SIGNATURE': 2}

10. Build markdown from the tree

import paradox_pdf as pdx

LEVEL = {"TITLE": 1, "SUBTITLE": 2, "H1": 3, "H2": 4, "H3": 5, "H4": 6}

def to_markdown(nodes, out=None):
    out = out if out is not None else []
    for n in nodes:
        t = n.get("type")
        text = (n.get("text") or "").strip()
        if t in LEVEL and text:
            out.append("#" * LEVEL[t] + " " + text)
        elif t == "PARAGRAPH":
            out.append(text)
        elif t == "TABLE":
            out.append(f"_<table {n['shape'][0]}x{n['shape'][1]}>_")
        out.append("")
        to_markdown(n.get("children", []), out)
    return "\n".join(out)

doc = pdx.extract("contract.pdf", no_images=True)
print(to_markdown(doc["elements"]))

Output schema

{
  "source": "contract.pdf",
  "total_pages": 12,
  "total_elements": 145,
  "total_images": 4,
  "type_summary": {"TITLE": 1, "PARAGRAPH": 67, "TABLE": 3, "...": "..."},
  "elements": [
    {
      "type": "TITLE",
      "marks": ["BOLD"],
      "text": "**Annual Report — Q4 2025**",
      "ref": "(p1,l1):(p12,l8)",
      "children": [
        {"type": "PARAGRAPH", "text": "...", "ref": "(p1,l2):(p1,l2)"},
        {"type": "H1",
         "text": "**1. Financial Summary**",
         "ref": "(p1,l3):(p2,l4)",
         "children": [
           {"type": "TABLE",
            "shape": [5, 4],
            "cells": [
              {"p": [0, 0], "t": "Category"},
              {"p": [0, 1, 1, 3], "t": "Studio Minimum Rates"}
            ],
            "ref": "(p1,l4):(p1,l4)"}
         ]}
      ]
    }
  ]
}

`ref` field

Every element gets a ref of the form "(pX,lY):(pX,lY)" where:

pX = page number (1-based),
lY = element index within that page (1-based).
The first tuple is the start; the second is the end of the element's last descendant.

Element types (excerpt)

TITLE, SUBTITLE, H1–H4, PARAGRAPH, TABLE, LIST (with items[]), TOC (with entries[]), IMAGE, SIGNATURE, AMENDMENT_DEL, EXHIBIT, APPENDIX, FOOTER, HEADER, PAGE_NUMBER, plus 50+ more. Full list: pdf_tagger/catalog.py.

Inline marks

Marks are preserved both in marks: [...] (per-element) and inline in the text:

Mark	Inline syntax
BOLD	`bold text`
ITALIC	`italic`
UNDERLINE	`++underlined++`
STRIKETHROUGH	`~~deleted~~`
SUPERSCRIPT	`^superscript^`
MONOSPACE	`code`

CLI

The same package installs a paradox-pdf command:

paradox-pdf contract.pdf                       # → output/contract.json
paradox-pdf contract.pdf -o result.json
paradox-pdf docs/ -o extracted/ -w 8           # parallel folder
paradox-pdf --pages 1-5 contract.pdf
paradox-pdf --no-images contract.pdf

Run paradox-pdf --help for the full set of flags.

How it works

                 ┌─────────────────┐
PDF ─────────────► scan_detector   │  per page (<50 chars → vision)
                 └────────┬────────┘
            ┌─────────────┴─────────────┐
            ▼                           ▼
   ┌──────────────────┐        ┌──────────────────────┐
   │ Heuristic        │        │ Vision               │
   │ (PyMuPDF fonts)  │        │ YOLO + RapidOCR      │
   │                  │        │ + Table Transformer  │
   │                  │        │ + HDBSCAN borderless │
   │                  │        │ + TexTAR (marks)     │
   └────────┬─────────┘        └──────────┬───────────┘
            └─────────────┬───────────────┘
                          ▼
              ┌────────────────────────┐
              │ Section tree builder   │
              │ Post-processing passes │
              └───────────┬────────────┘
                          ▼
                       JSON dict

For tables, three detectors run in parallel — vector lines (PyMuPDF), Table Transformer, OpenCV border morphology — and the highest-quality result wins by IoU 0.5 NMS scored on fill_rate + source_bonus − merge_penalty. Merged cells are detected by missing inner borders (≥35% pixel coverage threshold) for bordered tables, and by cell-width ratio (>1.6× column pitch) for borderless ones.

Performance notes

Digital page: ~0.05 s on CPU.
Scanned page: ~10 s on CPU, much faster on GPU (PyTorch detects and uses CUDA automatically).
First run: HuggingFace models are downloaded once (~500 MB total).

If you see multi-minute startup per document with the vision pipeline, set HF_HUB_OFFLINE=1 after the first download — HuggingFace's online metadata revalidation on slow networks is the bottleneck, not the actual inference:

HF_HUB_OFFLINE=1 python my_script.py

Or in code:

import os
os.environ["HF_HUB_OFFLINE"] = "1"
import paradox_pdf as pdx

Repository layout

paradox_pdf/         Public Python API (extract, extract_text, …)
pdf_tagger/          Core extraction (font classifier, vision layout, marks)
pdf_grid/            Vector-line table detection
scripts/             CLI implementation
docs/                Configuration reference, API reference, research notes
examples/            Sample PDFs + expected outputs
_dev/                Test suites, fixtures, benchmarks (not shipped in wheel)

License

Proprietary — © CreAI. Contact feliperodriguez@creai.mx for commercial use.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Apr 28, 2026

0.3.1

Apr 27, 2026

0.3.0

Apr 27, 2026

0.2.2

Apr 27, 2026

This version

0.2.1

Apr 27, 2026

0.2.0

Apr 27, 2026

0.1.3

Apr 21, 2026

0.1.1

Apr 21, 2026

0.1.0

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paradox_pdf-0.2.1.tar.gz (123.6 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paradox_pdf-0.2.1-py3-none-any.whl (138.1 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file paradox_pdf-0.2.1.tar.gz.

File metadata

Download URL: paradox_pdf-0.2.1.tar.gz
Upload date: Apr 27, 2026
Size: 123.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paradox_pdf-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`872d6a2f661d7b7c304413f7bf025d809c97cffb093b8184c80b585d0a24cab7`
MD5	`17b59618b63f563f973687d811b99a0b`
BLAKE2b-256	`557401f32d7f8cbe4242b37c5be01447e334383bd9e4671a74be9408fd04b4a9`

See more details on using hashes here.

File details

Details for the file paradox_pdf-0.2.1-py3-none-any.whl.

File metadata

Download URL: paradox_pdf-0.2.1-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 138.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paradox_pdf-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`44d6991045a154a4e4c2d95e6de7cc5fcef1bc13c8dcef3e8b23c7bea6f2d638`
MD5	`ef1394fda5a14250f9c294a1edf3a5c7`
BLAKE2b-256	`3bc2022dba5ccd1d6006f2133471a81743b0b2d2e2d886c247ee49cc9d793faa`

See more details on using hashes here.

paradox-pdf 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

paradox-pdf

Install

60-second quick start

Public API

Examples

1. Get the document tree

2. Process only certain pages

3. Plain text in one call

4. Extract every table

5. Persist to disk

6. Convert a folder

7. Custom configuration

8. Force a specific pipeline

9. Just count things

10. Build markdown from the tree

Output schema

ref field

Element types (excerpt)

Inline marks

CLI

How it works

Performance notes

Repository layout

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`ref` field