High-performance 2-stage PDF to Markdown extraction engine with layout detection, multi-strategy OCR, math/table support, and header/footer stripping.
Project description
LightningDoc ⚡
High-performance 2-stage PDF → Markdown extraction engine.
LightningDoc extracts clean, structured Markdown from any PDF — native text, scanned documents, handwritten forms, or mixed. It combines layout-aware parsing with multi-strategy OCR into a single pip install.
✨ Features
- 2-Stage Pipeline — Layout Detection → Text Extraction
- ~25 ms/page for native-text PDFs (200 pages in < 5 seconds)
- Font-based math detection — inline
$...$and display$$...$$LaTeX equations - Table extraction — PyMuPDF line detection + heuristic borderless table fallback
- Header/footer stripping — pattern + font-size aware margin detection
- Multi-strategy OCR — Tesseract (CLAHE+OTSU, contrast+sharpen), TrOCR handwriting, EasyOCR fusion
- Column-aware reading order — correct left→right ordering for multi-column layouts
- Zero API keys — everything runs 100% offline
- Built-in web UI — interactive viewer with bounding-box overlay, extraction dashboard
🚀 Quick Start
Installation
pip install lightningdoc
With TrOCR handwriting support:
pip install lightningdoc[llm]
System requirement: Tesseract OCR must be installed separately:
- macOS:
brew install tesseract- Ubuntu:
sudo apt install tesseract-ocr- Windows: installer
Python API
from pathlib import Path
from lightningdoc import extract_text_from_pdf, extract_with_timing
# Simple text extraction
text = extract_text_from_pdf(Path("report.pdf"))
print(text)
# With full timing breakdown
result = extract_with_timing(Path("report.pdf"))
print(f"{result.word_count} words in {result.total_ms:.0f}ms")
print(f"Stage 1 (layout): {result.stage1_ms:.0f}ms")
print(f"Stage 2 (extract): {result.stage2_ms:.0f}ms")
CLI
# Extract a PDF to Markdown
lightningdoc report.pdf
# Batch extract
lightningdoc *.pdf -o ./output
# With TrOCR handwriting recognition
lightningdoc form.pdf --trocr
# With EasyOCR fusion
lightningdoc scanned.pdf --easyocr
Web Viewer
lightningdoc --serve
# Open http://127.0.0.1:5050
Upload PDFs, view page images with bounding-box overlays, extract with one click, and see per-stage timing breakdowns.
🏗 Architecture
PDF ──→ Stage 1: Layout Detection (PyMuPDF, ~2ms/page)
├─ Page structure & bboxes
├─ Font metadata & columns
└─ Image positions & reading order
──→ Stage 2: Text Extraction (parallel, ~10ms/page)
├─ Native text → Markdown (math, tables, headings)
├─ Ligature & encoding repair
├─ Header/footer stripping
├─ Multi-strategy Tesseract OCR (scanned pages)
├─ TrOCR handwriting (optional)
├─ EasyOCR fusion (optional)
└─ Embedded image OCR (concurrent)
──→ Clean Markdown output
📦 Package Structure
lightningdoc/
├── types.py # TextSpan, TextBlock, PageLayout, ExtractionResult
├── orchestrator.py # Pipeline coordinator
├── cli.py # CLI entry point
├── server.py # Flask web viewer
├── pipeline/
│ ├── stage1_layout.py # Layout detection (pure PyMuPDF)
│ ├── stage2_extract.py # Extraction orchestrator
│ ├── math.py # Math font detection & LaTeX conversion
│ ├── tables.py # Table extraction (PyMuPDF + heuristic)
│ ├── headers.py # Header/footer detection & stripping
│ ├── ocr.py # Multi-strategy OCR
│ └── markdown.py # Block-to-Markdown conversion
├── preprocessing/
│ ├── ligatures.py # Unicode ligature repair
│ └── ocr_cleanup.py # OCR text cleanup & fusion
└── models/
└── trocr.py # TrOCR handwriting model (lazy-loaded)
⚡ Performance
| Document Type | Pages/sec | Method |
|---|---|---|
| Native-text PDF | 80+ pages/sec | Layout parsing |
| Scanned PDF | ~1 page/sec | Tesseract OCR (parallel workers) |
| Handwritten form | ~0.15 pages/sec | TrOCR + Tesseract hybrid |
- CPU-first — no CUDA required
- Apple Silicon MPS acceleration for TrOCR
- Parallel OCR workers for scanned pages
🔧 Optional Dependencies
| Extra | What it adds |
|---|---|
lightningdoc[llm] |
TrOCR handwriting recognition |
lightningdoc[easyocr] |
EasyOCR fusion for scanned pages |
lightningdoc[all] |
Everything |
License
Apache 2.0 — see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lightningdoc-2.0.0.tar.gz.
File metadata
- Download URL: lightningdoc-2.0.0.tar.gz
- Upload date:
- Size: 54.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90082fbee234f07f72d758c354db588c0aeaa045aa96d5383c829986fda9175c
|
|
| MD5 |
5b2c720f25998310b8ca0f1753eb199a
|
|
| BLAKE2b-256 |
19c5f7ae5ec8a8fb3fe507a397bf6dbe55fd93f207c568ba5fb5b6029c325917
|
File details
Details for the file lightningdoc-2.0.0-py3-none-any.whl.
File metadata
- Download URL: lightningdoc-2.0.0-py3-none-any.whl
- Upload date:
- Size: 58.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e7a93633a9be8d037fbcdba55bd463ad5c55d0680baf89a6b862f493ae55313
|
|
| MD5 |
31e9691fa98ef8bc05a2fbd0f693a660
|
|
| BLAKE2b-256 |
60443be84c77041f23a2527b021710e35580d5e7ffc58f8b8f5b0aed0be619eb
|