High-performance 3-stage PDF to Markdown extraction engine with layout detection, multi-strategy OCR, and NLP post-processing.
Project description
LightningDoc ⚡
High-performance 3-stage PDF → Markdown extraction engine.
LightningDoc extracts clean, structured Markdown from any PDF — native text, scanned documents, handwritten forms, or mixed. It combines layout-aware parsing, multi-strategy OCR, and optional AI post-processing into a single pip install.
✨ Features
- 3-Stage Pipeline — Layout Detection → Text Extraction → NLP Post-Processing
- ~25 ms/page for native-text PDFs (200 pages in < 5 seconds)
- Multi-strategy OCR — Tesseract (CLAHE+OTSU, contrast+sharpen), TrOCR handwriting, EasyOCR fusion
- GLM-OCR Vision Judge — 0.9B param multimodal model re-reads scanned pages for higher accuracy
- SmolLM2 NLP — 360M param local LLM for OCR correction and document classification
- Zero API keys — everything runs 100% offline after first model download
- Apple Silicon optimised — MPS acceleration for all neural models
- Built-in web UI — interactive viewer with bounding-box overlay, upload, extraction dashboard
🚀 Quick Start
Installation
pip install lightningdoc
With AI models (OCR correction, document classification, GLM-OCR):
pip install lightningdoc[llm]
System requirement: Tesseract OCR must be installed separately:
- macOS:
brew install tesseract- Ubuntu:
sudo apt install tesseract-ocr- Windows: installer
Python API
from pathlib import Path
from lightningdoc import extract_text_from_pdf, extract_with_timing
# Simple text extraction
text = extract_text_from_pdf(Path("report.pdf"))
print(text)
# With full timing breakdown
result = extract_with_timing(Path("report.pdf"))
print(f"{result.word_count} words in {result.total_ms:.0f}ms")
print(f"Stage 1 (layout): {result.stage1_ms:.0f}ms")
print(f"Stage 2 (extract): {result.stage2_ms:.0f}ms")
print(f"Stage 3 (NLP): {result.stage3_ms:.0f}ms")
print(f"Document type: {result.doc_type}")
CLI
# Extract a PDF to Markdown
lightningdoc report.pdf
# Batch extract
lightningdoc *.pdf -o ./output
# With GLM-OCR vision judge (for scanned docs)
lightningdoc scanned.pdf --glm-ocr
# With TrOCR handwriting recognition
lightningdoc form.pdf --trocr
# Skip AI (rules-only, fastest)
lightningdoc report.pdf --no-llm
Web Viewer
lightningdoc --serve
# Open http://127.0.0.1:5050
Upload PDFs, view page images with bounding-box overlays, extract with one click, and see per-stage timing breakdowns.
🏗 Architecture
PDF ──→ Stage 1: Layout Detection (PyMuPDF, ~2ms/page)
├─ Page structure & bboxes
├─ Font metadata & columns
└─ Image positions & reading order
──→ Stage 2: Text Extraction (parallel, ~10ms/page)
├─ Native text → Markdown
├─ Ligature & encoding repair
├─ Multi-strategy Tesseract OCR
├─ TrOCR handwriting (optional)
├─ EasyOCR fusion (optional)
└─ Embedded image OCR (concurrent)
──→ Stage 3: NLP Post-Processing (rules + AI)
├─ Rule-based OCR corrections
├─ GLM-OCR vision judge (optional)
├─ SmolLM2 field extraction (fallback)
└─ Document classification
──→ Clean Markdown output
📦 Package Structure
lightningdoc/
├── types.py # TextSpan, TextBlock, PageLayout, ExtractionResult
├── orchestrator.py # Pipeline coordinator
├── cli.py # CLI entry point
├── server.py # Flask web viewer
├── pipeline/
│ ├── stage1_layout.py # Layout detection
│ ├── stage2_extract.py # Text extraction + OCR
│ └── stage3_nlp.py # NLP post-processing
├── preprocessing/
│ ├── ligatures.py # Unicode ligature repair
│ └── ocr_cleanup.py # Numeric fix, medical forms
├── models/
│ ├── trocr.py # TrOCR handwriting model
│ └── glm_ocr.py # GLM-OCR vision model
└── llm/
├── engine.py # SmolLM2-360M inference
├── correction.py # OCR post-correction
└── classifier.py # Document classification
⚡ Performance
| Document Type | Pages/sec | Method |
|---|---|---|
| Native-text PDF | 80+ pages/sec | Layout parsing |
| Scanned PDF | ~1 page/sec | Tesseract OCR (parallel workers) |
| Handwritten form | ~0.15 pages/sec | TrOCR + Tesseract hybrid |
- CPU-first — no CUDA required
- Apple Silicon MPS acceleration for neural models
- Parallel OCR workers for scanned pages
- Background model preloading (overlaps Stage 1+2)
🔧 Optional Dependencies
| Extra | What it adds |
|---|---|
lightningdoc[llm] |
SmolLM2 OCR correction + document classification + GLM-OCR vision judge |
lightningdoc[easyocr] |
EasyOCR fusion for scanned pages |
lightningdoc[all] |
Everything |
License
Apache 2.0 — see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lightningdoc-1.0.0.tar.gz.
File metadata
- Download URL: lightningdoc-1.0.0.tar.gz
- Upload date:
- Size: 53.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
545884dd573d0d7947571010a20691cf13cee80b7c87c62469c8e8838bb91e02
|
|
| MD5 |
cb1ecf9cec6146ec46dc973aa55a2c4c
|
|
| BLAKE2b-256 |
464be4a3040091320838d6b7af006686712e3e79222f0a291c34696eedc983e6
|
File details
Details for the file lightningdoc-1.0.0-py3-none-any.whl.
File metadata
- Download URL: lightningdoc-1.0.0-py3-none-any.whl
- Upload date:
- Size: 59.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25d21aecba96ae08c5ad336b4624c38a7d7fcec9373273191276eb11ee92b403
|
|
| MD5 |
8af279ec7e5d2d01c0d616205527d3c2
|
|
| BLAKE2b-256 |
30eec745348e1487d823b36c171f4a11a64f00c3508ff2d3ad21eb318aacb020
|