Skip to main content

High-performance 2-stage PDF to Markdown extraction engine with layout detection, multi-strategy OCR, math/table support, and header/footer stripping.

Project description

LightningDoc ⚡

High-performance 2-stage PDF → Markdown extraction engine.

LightningDoc extracts clean, structured Markdown from any PDF — native text, scanned documents, handwritten forms, or mixed. It combines layout-aware parsing with multi-strategy OCR into a single pip install.

PyPI version Python 3.10+ License: Apache 2.0


✨ Features

  • 2-Stage Pipeline — Layout Detection → Text Extraction
  • ~25 ms/page for native-text PDFs (200 pages in < 5 seconds)
  • Font-based math detection — inline $...$ and display $$...$$ LaTeX equations
  • Table extraction — PyMuPDF line detection + heuristic borderless table fallback
  • Header/footer stripping — pattern + font-size aware margin detection
  • Multi-strategy OCR — Tesseract (CLAHE+OTSU, contrast+sharpen), TrOCR handwriting, EasyOCR fusion
  • Column-aware reading order — correct left→right ordering for multi-column layouts
  • Zero API keys — everything runs 100% offline
  • Built-in web UI — interactive viewer with bounding-box overlay, extraction dashboard

🚀 Quick Start

Installation

pip install lightningdoc

With TrOCR handwriting support:

pip install lightningdoc[llm]

System requirement: Tesseract OCR must be installed separately:

  • macOS: brew install tesseract
  • Ubuntu: sudo apt install tesseract-ocr
  • Windows: installer

Python API

from pathlib import Path
from lightningdoc import extract_text_from_pdf, extract_with_timing

# Simple text extraction
text = extract_text_from_pdf(Path("report.pdf"))
print(text)

# With full timing breakdown
result = extract_with_timing(Path("report.pdf"))
print(f"{result.word_count} words in {result.total_ms:.0f}ms")
print(f"Stage 1 (layout):  {result.stage1_ms:.0f}ms")
print(f"Stage 2 (extract): {result.stage2_ms:.0f}ms")

CLI

# Extract a PDF to Markdown
lightningdoc report.pdf

# Batch extract
lightningdoc *.pdf -o ./output

# With TrOCR handwriting recognition
lightningdoc form.pdf --trocr

# With EasyOCR fusion
lightningdoc scanned.pdf --easyocr

Web Viewer

lightningdoc --serve
# Open http://127.0.0.1:5050

Upload PDFs, view page images with bounding-box overlays, extract with one click, and see per-stage timing breakdowns.


🏗 Architecture

PDF ──→ Stage 1: Layout Detection     (PyMuPDF, ~2ms/page)
         ├─ Page structure & bboxes
         ├─ Font metadata & columns
         └─ Image positions & reading order

     ──→ Stage 2: Text Extraction      (parallel, ~10ms/page)
         ├─ Native text → Markdown (math, tables, headings)
         ├─ Ligature & encoding repair
         ├─ Header/footer stripping
         ├─ Multi-strategy Tesseract OCR (scanned pages)
         ├─ TrOCR handwriting (optional)
         ├─ EasyOCR fusion (optional)
         └─ Embedded image OCR (concurrent)

     ──→ Clean Markdown output

📦 Package Structure

lightningdoc/
├── types.py              # TextSpan, TextBlock, PageLayout, ExtractionResult
├── orchestrator.py       # Pipeline coordinator
├── cli.py                # CLI entry point
├── server.py             # Flask web viewer
├── pipeline/
│   ├── stage1_layout.py  # Layout detection (pure PyMuPDF)
│   ├── stage2_extract.py # Extraction orchestrator
│   ├── math.py           # Math font detection & LaTeX conversion
│   ├── tables.py         # Table extraction (PyMuPDF + heuristic)
│   ├── headers.py        # Header/footer detection & stripping
│   ├── ocr.py            # Multi-strategy OCR
│   └── markdown.py       # Block-to-Markdown conversion
├── preprocessing/
│   ├── ligatures.py      # Unicode ligature repair
│   └── ocr_cleanup.py    # OCR text cleanup & fusion
└── models/
    └── trocr.py           # TrOCR handwriting model (lazy-loaded)

⚡ Performance

Document Type Pages/sec Method
Native-text PDF 80+ pages/sec Layout parsing
Scanned PDF ~1 page/sec Tesseract OCR (parallel workers)
Handwritten form ~0.15 pages/sec TrOCR + Tesseract hybrid
  • CPU-first — no CUDA required
  • Apple Silicon MPS acceleration for TrOCR
  • Parallel OCR workers for scanned pages

🔧 Optional Dependencies

Extra What it adds
lightningdoc[llm] TrOCR handwriting recognition
lightningdoc[easyocr] EasyOCR fusion for scanned pages
lightningdoc[all] Everything

License

Apache 2.0 — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightningdoc-2.0.0.tar.gz (54.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lightningdoc-2.0.0-py3-none-any.whl (58.6 kB view details)

Uploaded Python 3

File details

Details for the file lightningdoc-2.0.0.tar.gz.

File metadata

  • Download URL: lightningdoc-2.0.0.tar.gz
  • Upload date:
  • Size: 54.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for lightningdoc-2.0.0.tar.gz
Algorithm Hash digest
SHA256 90082fbee234f07f72d758c354db588c0aeaa045aa96d5383c829986fda9175c
MD5 5b2c720f25998310b8ca0f1753eb199a
BLAKE2b-256 19c5f7ae5ec8a8fb3fe507a397bf6dbe55fd93f207c568ba5fb5b6029c325917

See more details on using hashes here.

File details

Details for the file lightningdoc-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: lightningdoc-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 58.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for lightningdoc-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1e7a93633a9be8d037fbcdba55bd463ad5c55d0680baf89a6b862f493ae55313
MD5 31e9691fa98ef8bc05a2fbd0f693a660
BLAKE2b-256 60443be84c77041f23a2527b021710e35580d5e7ffc58f8b8f5b0aed0be619eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page