High-performance 2-stage PDF to Markdown extraction engine with layout detection, multi-strategy OCR, math/table support, and header/footer stripping.

These details have not been verified by PyPI

Project links

Project description

LightningDoc ⚡

High-performance 2-stage PDF → Markdown extraction engine.

LightningDoc extracts clean, structured Markdown from any PDF — native text, scanned documents, handwritten forms, or mixed. It combines layout-aware parsing with multi-strategy OCR into a single pip install.

✨ Features

2-Stage Pipeline — Layout Detection → Text Extraction
~25 ms/page for native-text PDFs (200 pages in < 5 seconds)
Font-based math detection — inline $...$ and display $$...$$ LaTeX equations
Table extraction — PyMuPDF line detection + heuristic borderless table fallback
Header/footer stripping — pattern + font-size aware margin detection
Multi-strategy OCR — Tesseract (CLAHE+OTSU, contrast+sharpen), TrOCR handwriting, EasyOCR fusion
Column-aware reading order — correct left→right ordering for multi-column layouts
Zero API keys — everything runs 100% offline
Built-in web UI — interactive viewer with bounding-box overlay, extraction dashboard

🚀 Quick Start

Installation

pip install lightningdoc

With TrOCR handwriting support:

pip install lightningdoc[llm]

System requirement: Tesseract OCR must be installed separately:

macOS: brew install tesseract

Ubuntu: sudo apt install tesseract-ocr

Windows: installer

Python API

from pathlib import Path
from lightningdoc import extract_text_from_pdf, extract_with_timing

# Simple text extraction
text = extract_text_from_pdf(Path("report.pdf"))
print(text)

# With full timing breakdown
result = extract_with_timing(Path("report.pdf"))
print(f"{result.word_count} words in {result.total_ms:.0f}ms")
print(f"Stage 1 (layout):  {result.stage1_ms:.0f}ms")
print(f"Stage 2 (extract): {result.stage2_ms:.0f}ms")

CLI

# Extract a PDF to Markdown
lightningdoc report.pdf

# Batch extract
lightningdoc *.pdf -o ./output

# With TrOCR handwriting recognition
lightningdoc form.pdf --trocr

# With EasyOCR fusion
lightningdoc scanned.pdf --easyocr

Web Viewer

lightningdoc --serve
# Open http://127.0.0.1:5050

Upload PDFs, view page images with bounding-box overlays, extract with one click, and see per-stage timing breakdowns.

🏗 Architecture

PDF ──→ Stage 1: Layout Detection     (PyMuPDF, ~2ms/page)
         ├─ Page structure & bboxes
         ├─ Font metadata & columns
         └─ Image positions & reading order

     ──→ Stage 2: Text Extraction      (parallel, ~10ms/page)
         ├─ Native text → Markdown (math, tables, headings)
         ├─ Ligature & encoding repair
         ├─ Header/footer stripping
         ├─ Multi-strategy Tesseract OCR (scanned pages)
         ├─ TrOCR handwriting (optional)
         ├─ EasyOCR fusion (optional)
         └─ Embedded image OCR (concurrent)

     ──→ Clean Markdown output

📦 Package Structure

lightningdoc/
├── types.py              # TextSpan, TextBlock, PageLayout, ExtractionResult
├── orchestrator.py       # Pipeline coordinator
├── cli.py                # CLI entry point
├── server.py             # Flask web viewer
├── pipeline/
│   ├── stage1_layout.py  # Layout detection (pure PyMuPDF)
│   ├── stage2_extract.py # Extraction orchestrator
│   ├── math.py           # Math font detection & LaTeX conversion
│   ├── tables.py         # Table extraction (PyMuPDF + heuristic)
│   ├── headers.py        # Header/footer detection & stripping
│   ├── ocr.py            # Multi-strategy OCR
│   └── markdown.py       # Block-to-Markdown conversion
├── preprocessing/
│   ├── ligatures.py      # Unicode ligature repair
│   └── ocr_cleanup.py    # OCR text cleanup & fusion
└── models/
    └── trocr.py           # TrOCR handwriting model (lazy-loaded)

⚡ Performance

Document Type	Pages/sec	Method
Native-text PDF	80+ pages/sec	Layout parsing
Scanned PDF	~1 page/sec	Tesseract OCR (parallel workers)
Handwritten form	~0.15 pages/sec	TrOCR + Tesseract hybrid

CPU-first — no CUDA required
Apple Silicon MPS acceleration for TrOCR
Parallel OCR workers for scanned pages

🔧 Optional Dependencies

Extra	What it adds
`lightningdoc[llm]`	TrOCR handwriting recognition
`lightningdoc[easyocr]`	EasyOCR fusion for scanned pages
`lightningdoc[all]`	Everything

License

Apache 2.0 — see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

Feb 17, 2026

1.0.0

Feb 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightningdoc-2.0.0.tar.gz (54.1 kB view details)

Uploaded Feb 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lightningdoc-2.0.0-py3-none-any.whl (58.6 kB view details)

Uploaded Feb 17, 2026 Python 3

File details

Details for the file lightningdoc-2.0.0.tar.gz.

File metadata

Download URL: lightningdoc-2.0.0.tar.gz
Upload date: Feb 17, 2026
Size: 54.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for lightningdoc-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`90082fbee234f07f72d758c354db588c0aeaa045aa96d5383c829986fda9175c`
MD5	`5b2c720f25998310b8ca0f1753eb199a`
BLAKE2b-256	`19c5f7ae5ec8a8fb3fe507a397bf6dbe55fd93f207c568ba5fb5b6029c325917`

See more details on using hashes here.

File details

Details for the file lightningdoc-2.0.0-py3-none-any.whl.

File metadata

Download URL: lightningdoc-2.0.0-py3-none-any.whl
Upload date: Feb 17, 2026
Size: 58.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for lightningdoc-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1e7a93633a9be8d037fbcdba55bd463ad5c55d0680baf89a6b862f493ae55313`
MD5	`31e9691fa98ef8bc05a2fbd0f693a660`
BLAKE2b-256	`60443be84c77041f23a2527b021710e35580d5e7ffc58f8b8f5b0aed0be619eb`

See more details on using hashes here.

lightningdoc 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LightningDoc ⚡

✨ Features

🚀 Quick Start

Installation

Python API

CLI

Web Viewer

🏗 Architecture

📦 Package Structure

⚡ Performance

🔧 Optional Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes