Structured text extraction framework for digital and scanned PDFs with inline formatting preservation

These details have not been verified by PyPI

Project links

Project description

Paradox

Structured text extraction framework for digital and scanned PDFs with inline formatting preservation.

Paradox is a dual-pipeline framework that extracts semantically typed, hierarchically structured content from PDF documents. It detects whether each page is digital or scanned and routes it to the optimal extraction strategy, producing a single unified JSON output regardless of source quality.

Key Features

Dual pipeline: heuristic (digital PDFs) + vision (scanned/photographed), routed per page
68+ element types: headings, paragraphs, tables, lists, amendments, signatures, metadata, and more
6 inline marks: bold, italic, ++underline++, ~~strikethrough~~, ^superscript^, monospace
Table extraction with merged cell detection (colspan/rowspan) via vector analysis and OpenCV
Hierarchical JSON output with traceable page refs (pX,lY):(pX,lY)
Ensemble table detection: HDBSCAN clustering + Table Transformer + OpenCV, scored and arbitrated
Configurable via 30+ parameters (dataclass + environment variable overrides)

Install

pip install -r requirements.txt

Usage

python scripts/convert.py document.pdf

That's it. Output goes to output/document.json.

Options

Flag	Description
`-o PATH`	Custom output path (file or directory)
`-w N`	Parallel workers (default: 20)
`--pages 1-5`	Extract specific pages only
`--force-vision`	Force vision pipeline on all pages
`--force-heuristic`	Force heuristic pipeline on all pages
`--compare PAGE`	Generate visual QA PNGs for table on page
`--no-images`	Skip embedded image extraction
`--compact`	Compact JSON (no indentation)

python scripts/convert.py contract.pdf -o result.json     # Custom output
python scripts/convert.py docs/ -o extracted/ -w 8        # Batch folder, 8 workers
python scripts/convert.py scan.pdf --force-vision         # Force OCR pipeline
python scripts/convert.py contract.pdf --compare all      # Visual QA for all tables

How It Works

Pipeline Overview

Digital Pipeline — Real Example

Digital Pipeline Example

Extracted output (abbreviated):

{
  "elements": [
    {"type": "TITLE", "marks": ["BOLD"], "text": "**Annual Report — Q4 2025**"},
    {"type": "PARAGRAPH", "text": "Revenue increased by **12.3%**...driven by *international expansion*..."},
    {"type": "H1", "marks": ["BOLD"], "text": "**1. Financial Summary**",
      "children": [
        {"type": "TABLE", "shape": [5, 4], "cells": [
          {"p": [0,0], "t": "Category"}, {"p": [0,1], "t": "Q3 2025"}, ...
        ]}
      ]},
    {"type": "H1", "marks": ["BOLD"], "text": "**2. Notes**",
      "children": [
        {"type": "PARAGRAPH", "text": "All figures reported in USD. See *Appendix A*."}
      ]}
  ]
}

Vision Pipeline — Real Example

Vision Pipeline Example

The key insight: digital PDFs contain rich font metadata (bold flags, font names, vector drawings) that enable near-perfect extraction. Scanned PDFs have none of this -- they are just images. Rather than forcing one approach on both, Paradox routes each page independently to the optimal pipeline, then produces an identical output format. A single document can have digital contract pages interleaved with scanned signed exhibits, and every page gets the best available extraction.

Digital Path (Heuristic)

The heuristic pipeline leverages PyMuPDF to extract text spans with full font metadata. Each span carries flags for bold, italic, superscript, and monospace. Underline and strikethrough are detected geometrically by finding horizontal vector lines drawn across text baselines -- this is necessary because most PDF producers draw these as separate line objects rather than setting a font flag.

Block classification uses a priority ladder with 10 levels, progressing from document metadata through titles and headings down to body content. Font size, weight, and position on the page all contribute to the classification decision.

Speed: ~250 pages/second. No GPU required.

Vision Path (YOLO + OCR + TexTAR)

The vision pipeline renders each page to an image and processes it through three stages:

DocLayout-YOLO detects page regions (title, text, table, figure, caption, etc.)
RapidOCR (ONNX runtime) extracts word-level text from each detected region
TexTAR (Vision Transformer, ICDAR 2025) classifies inline marks per word: bold, italic, underline, and strikethrough

Strikethrough detection uses an OpenCV fallback because TexTAR's accuracy on strikethrough alone is poor. Superscript is detected via a bounding-box height heuristic (word height below 60% of the line median). Monospace cannot be detected from images and is unavailable in the vision path.

Speed: ~12 pages/second (GPU), ~0.5 pages/second (CPU-only).

Table Extraction Strategy

Table Extraction Ensemble

Scanned table regions detected by YOLO are processed through two parallel structure-extraction strategies:

HDBSCAN clustering (cluster_cells.py): a purely geometric approach that clusters OCR bounding boxes into rows and columns. Warp-invariant and effective on borderless or loosely formatted tables.
Table Transformer + OpenCV (table_vision.py): a semantic approach that detects row and column separators. Better for tables with clear borders.

A scoring function (_score_struct()) evaluates both results and picks the winner based on cell fill rate, a merge penalty (linear: merge_ratio * 1.5), and word-count compatibility. Source priority provides tie-breaking: vectorial (+0.3) > vision table (+0.1) > HDBSCAN (0.0).

When both the vectorial pipeline (for digital content) and the vision pipeline detect the same table, a deduplication step (_dedupe_tables()) groups candidates by bounding-box IoU >= 0.5 and keeps only the highest-scoring result per group.

Output Format

Both pipelines produce identical JSON. The output is a hierarchical tree where headings contain their children (paragraphs, tables, lists), enabling semantic navigation of the document.

{
  "source": "contract.pdf",           //  Source filename
  "total_pages": 228,                 //  Document stats
  "total_elements": 544,
  "type_summary": {                   //  Element count by type
    "PARAGRAPH": 200,
    "H1": 50,
    "TABLE": 12
  },

  // HIERARCHICAL TREE — headings nest their children by depth
  //    TITLE > SUBTITLE > H1 > H2 > H3 > H4
  //    Each heading "owns" everything until the next heading of equal or higher level
  "elements": [
    {
      "type": "H1",                   //  Element type (68+ types available)
      "marks": ["BOLD"],              //  Detected formatting: BOLD, ITALIC, UNDERLINE,
                                      //    STRIKETHROUGH, SUPERSCRIPT, MONOSPACE
      "text": "**ARTICLE 13.  MINIMUM COMPENSATION**",
                                      //  Text with inline markers:
                                      //    **bold**  *italic*  ++underline++
                                      //    ~~strike~~  ^super^  `mono`
      "ref": "(p91,l1):(p95,l12)",    // Traceable location in source PDF
                                      //    (page 91, element 1) to (page 95, element 12)

      // CHILDREN — everything under this H1 until the next H1
      "children": [
        {
          "type": "TABLE",
          "marks": [],
          "shape": [14, 7],           // Table dimensions: 14 rows × 7 columns
          "cells": [
            //  Merged cells: "p": [row, col, rowspan, colspan]
            {"p": [0, 0, 2, 2], "t": "HIGH BUDGET"},     // spans 2 rows × 2 cols
            {"p": [0, 2, 1, 5], "t": "EFFECTIVE"},        // spans 1 row × 5 cols

            // Normal cells: "p": [row, col]
            {"p": [2, 1], "t": "Screenplay, including treatment"},
            {"p": [2, 2], "t": "$126,089"}
          ]
        },
        {
          //  Strikethrough — deleted text preserved with ~~markers~~
          "type": "PARAGRAPH",
          "marks": ["STRIKETHROUGH"],
          "text": "~~The term of this Agreement shall be for a period commencing July 1, 2017~~",
          "ref": "(p91,l5):(p91,l5)"
        }
      ]
    }
  ]
}

Ref format

"ref": "(p1,l3):(p6,l2)"
         |  |     |  |
         |  |     |  +-- element 2 on that page
         |  |     +----- page 6
         |  +----------- element 3 on that page
         +-------------- page 1

Both p (page) and l (element index) are 1-based.

Table cell format

"p": [r, c] -- normal cell at row r, column c
"p": [r, c, rs, cs] -- merged cell spanning rs rows and cs columns
"t" -- cell text, with \n for internal line breaks

Inline markers

Mark	Syntax	Example
BOLD	`text`	`ARTICLE 1`
ITALIC	`text`	`See exhibit`
BOLD+ITALIC	`*text*`	`*IMPORTANT*`
STRIKETHROUGH	`~~text~~`	`~~Deleted~~`
UNDERLINE	`++text++`	`++Underlined++`
SUPERSCRIPT	`^text^`	`^1^`
MONOSPACE	`text`	`DocID-123`

Configuration

All pipeline thresholds are centralized in pdf_tagger/config.py as a PipelineConfig dataclass. Override any value via environment variables with the PDF_ prefix:

PDF_RENDER_DPI=300 python scripts/convert.py input.pdf
PDF_CV_BORDER_MISSING_THRESHOLD=0.25 python scripts/convert.py input.pdf

Parameter	Default	Purpose
`PDF_SCAN_TEXT_THRESHOLD`	50	Characters below which a page is routed to the vision pipeline
`PDF_RENDER_DPI`	200	DPI for rendering pages to images (higher = better OCR, slower)
`PDF_CV_BORDER_MISSING_THRESHOLD`	0.35	OpenCV border coverage ratio below which a cell boundary is considered missing (triggers merge)
`PDF_YOLO_CONFIDENCE`	0.2	Minimum confidence for YOLO region detections
`PDF_OCR_MIN_CONFIDENCE`	0.3	Minimum confidence for OCR word results
`PDF_HDBSCAN_MIN_FILL_RATE`	0.40	Minimum cell fill rate to accept an HDBSCAN table structure
`PDF_DEDUP_IOU_THRESHOLD`	0.5	IoU threshold for suppressing duplicate table detections

Performance

Benchmark results (April 2026):

Suite	Shape Match	Cell Accuracy	Text Similarity
Digital via Vision	100%	88.9%	100%
Real Scanned	80%	67.8%	86.6%
Stress Tests	100%	82.0%	100%

Shape match measures whether the extracted grid dimensions are correct. Cell accuracy measures exact content match per cell. Text similarity is a fuzzy comparison of all extracted text against ground truth.

Architecture

paradox/
│
├── scripts/
│   └── convert.py                  CLI entry point — single PDF or batch folder
│
├── models/
│   └── textar-trained.pt           TexTAR Vision Transformer weights (64 MB, ICDAR 2025)
│
├── pdf_tagger/                     Core extraction pipeline
│   │
│   ├── tagger_json.py              Main orchestrator
│   │                                   Routes pages (digital/scanned), runs post-processing,
│   │                                   deduplicates tables, builds section tree, assigns refs
│   │
│   ├── config.py                   PipelineConfig dataclass — 30+ tunable thresholds
│   ├── scan_detector.py            Per-page routing: < 50 chars text → vision path
│   ├── catalog.py                  68+ element types (TITLE, H1, TABLE, PARAGRAPH, ...)
│   │
│   │   ── Digital Path ──
│   ├── font_classifier.py          PyMuPDF font extraction → block classification
│   │                                   Priority ladder: 10 levels, metadata → titles → body
│   ├── tagger.py                   Geometric detection of strikethrough + underline
│   │                                   Finds real vector lines drawn across text baselines
│   │
│   │   ── Vision Path ──
│   ├── vision_layout.py            DocLayout-YOLO region detection + RapidOCR
│   │                                   Ensemble scoring: HDBSCAN vs TATR for tables
│   ├── marks_vision.py             TexTAR per-word mark classification
│   │                                   T1: BOLD/ITALIC  ·  T2: UNDERLINE/STRIKETHROUGH
│   │                                   + OpenCV fallback + superscript heuristic
│   ├── table_vision.py             Table Transformer (DETR) + OpenCV grid detection
│   │                                   Merged cell detection via border coverage analysis
│   ├── cluster_cells.py            HDBSCAN density clustering for borderless tables
│   │                                   Row/column inference from OCR word positions
│   │
│   │   ── Shared ──
│   ├── camscanner.py               Deskew + perspective correction for photos
│   ├── table_compare.py            isual QA: side-by-side table PNG comparison
│   └── textar_model/               TexTAR architecture (DeiT backbone + dual heads)
│
├── pdf_grid/                       Low-level table geometry
│   ├── line_tables.py              ── Vector lines → grid → merged cells (digital PDFs)
│   ├── cv_tables.py                ── OpenCV morphology → border-missing detection
│   ├── borderless_tables.py        ── Column alignment → borderless table detection
│   ├── geometry.py                 ── Clustering, coverage, geometric primitives
│   ├── extractor.py                ── Grid extraction coordinator
│   ├── text_layout.py              ── Text-to-cell assignment
│   └── types.py                    ── Shared type definitions (Word, BBox)
│
├── docs/                           Documentation
│   ├── architecture.md             Deep dive: why dual pipeline, design decisions
│   ├── why-this-approach.md        Tradeoffs: vs LLMs, vs Docling/MinerU/Marker
│   ├── configuration.md            All 30+ parameters with types and defaults
│   ├── api-reference.md            Python API, JSON schema, CLI reference
│   └── research/                   12 research documents (YAML status headers)
│
└── examples/                       Sample PDFs + expected JSON outputs

For a detailed architectural walkthrough, see docs/architecture.md. For design decisions and tradeoffs, see docs/why-this-approach.md.

Marks Coverage

Mark	Syntax	Digital	Vision	Method (Digital)	Method (Vision)
Bold	`text`	✅	✅	Font flags + name	TexTAR T1 head
Italic	`text`	✅	✅	Font flags	TexTAR T1 head
Underline	`++text++`	✅	✅	Geometric line detection	TexTAR T2 head
Strikethrough	`~~text~~`	✅	✅	Geometric line detection	TexTAR T2 + OpenCV
Superscript	`^text^`	✅	✅	Font flags	BBox height heuristic
Monospace	`text`	✅	❌	Font name heuristic	Not available

Inline marks within table cells are not yet extracted in either pipeline.

Limitations

MONOSPACE: only detected in digital PDFs via font name heuristic. No visual equivalent exists for scanned pages because monospace fonts are not reliably distinguishable from proportional fonts at typical scan resolutions.
Multi-column layouts: reading order may interleave columns on scanned pages. The heuristic pipeline benefits from PyMuPDF's text flow analysis; the vision pipeline relies on YOLO region order, which can fail on complex layouts.
Complex scanned tables: tables with more than ~8 rows and complex multi-level headers may lose 1-2 rows due to HDBSCAN clustering sensitivity or YOLO region boundary errors.
Table cell marks: inline formatting (bold, italic, etc.) within individual table cells is not extracted in either pipeline.
GPU recommendation: the vision pipeline runs on CPU but is approximately 24x slower. A CUDA-capable GPU is strongly recommended for batch processing of scanned documents.

Development

Testing suites and benchmarks live in the _dev/ directory:

python scripts/run_tests.py                # All suites combined
python scripts/test_vision_vs_heuristic.py # Vision vs heuristic on digital PDFs
python scripts/test_vision_full.py         # 3 suites: digital, scanned, stress
python scripts/test_complex_tables.py      # 10 complex tables with merged cells
python scripts/bench_distortion.py         # Robustness under geometric distortions

See _dev/README.md for detailed testing instructions and ground-truth format.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Apr 28, 2026

0.3.1

Apr 27, 2026

0.3.0

Apr 27, 2026

0.2.2

Apr 27, 2026

0.2.1

Apr 27, 2026

0.2.0

Apr 27, 2026

0.1.3

Apr 21, 2026

This version

0.1.1

Apr 21, 2026

0.1.0

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paradox_pdf-0.1.1.tar.gz (119.5 kB view details)

Uploaded Apr 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paradox_pdf-0.1.1-py3-none-any.whl (136.3 kB view details)

Uploaded Apr 21, 2026 Python 3

File details

Details for the file paradox_pdf-0.1.1.tar.gz.

File metadata

Download URL: paradox_pdf-0.1.1.tar.gz
Upload date: Apr 21, 2026
Size: 119.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paradox_pdf-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`cc370df2da74788205c896ef74bda86dd89d5cd0c057c36553fb66388fc92de4`
MD5	`9d9a3e495363811e7a9fe4ea8482ac9e`
BLAKE2b-256	`58b34d9dabc637627e25dc472732674ae0d50018d069ae4bd7ef9c3194d2c37d`

See more details on using hashes here.

File details

Details for the file paradox_pdf-0.1.1-py3-none-any.whl.

File metadata

Download URL: paradox_pdf-0.1.1-py3-none-any.whl
Upload date: Apr 21, 2026
Size: 136.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paradox_pdf-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`04db6f1cd820474a1fa05172e47b9d6c11da03a26cc7d55c2154d700b4d448fa`
MD5	`e2cf2a98d90f332e2ee0c25244917c39`
BLAKE2b-256	`868f79d44861e5359e54994583774b50f5f26474db3abc125345a1707b5f80d4`

See more details on using hashes here.

paradox-pdf 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Paradox

Key Features

Install

Usage

Options

How It Works

Digital Pipeline — Real Example

Vision Pipeline — Real Example

Digital Path (Heuristic)

Vision Path (YOLO + OCR + TexTAR)

Table Extraction Strategy

Output Format

Ref format

Table cell format

Inline markers

Configuration

Performance

Architecture

Marks Coverage

Limitations

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes