High-performance 3-stage PDF to Markdown extraction engine with layout detection, multi-strategy OCR, and NLP post-processing.

These details have not been verified by PyPI

Project links

Project description

LightningDoc ⚡

High-performance 3-stage PDF → Markdown extraction engine.

LightningDoc extracts clean, structured Markdown from any PDF — native text, scanned documents, handwritten forms, or mixed. It combines layout-aware parsing, multi-strategy OCR, and optional AI post-processing into a single pip install.

✨ Features

3-Stage Pipeline — Layout Detection → Text Extraction → NLP Post-Processing
~25 ms/page for native-text PDFs (200 pages in < 5 seconds)
Multi-strategy OCR — Tesseract (CLAHE+OTSU, contrast+sharpen), TrOCR handwriting, EasyOCR fusion
GLM-OCR Vision Judge — 0.9B param multimodal model re-reads scanned pages for higher accuracy
SmolLM2 NLP — 360M param local LLM for OCR correction and document classification
Zero API keys — everything runs 100% offline after first model download
Apple Silicon optimised — MPS acceleration for all neural models
Built-in web UI — interactive viewer with bounding-box overlay, upload, extraction dashboard

🚀 Quick Start

Installation

pip install lightningdoc

With AI models (OCR correction, document classification, GLM-OCR):

pip install lightningdoc[llm]

System requirement: Tesseract OCR must be installed separately:

macOS: brew install tesseract

Ubuntu: sudo apt install tesseract-ocr

Windows: installer

Python API

from pathlib import Path
from lightningdoc import extract_text_from_pdf, extract_with_timing

# Simple text extraction
text = extract_text_from_pdf(Path("report.pdf"))
print(text)

# With full timing breakdown
result = extract_with_timing(Path("report.pdf"))
print(f"{result.word_count} words in {result.total_ms:.0f}ms")
print(f"Stage 1 (layout):  {result.stage1_ms:.0f}ms")
print(f"Stage 2 (extract): {result.stage2_ms:.0f}ms")
print(f"Stage 3 (NLP):     {result.stage3_ms:.0f}ms")
print(f"Document type:     {result.doc_type}")

CLI

# Extract a PDF to Markdown
lightningdoc report.pdf

# Batch extract
lightningdoc *.pdf -o ./output

# With GLM-OCR vision judge (for scanned docs)
lightningdoc scanned.pdf --glm-ocr

# With TrOCR handwriting recognition
lightningdoc form.pdf --trocr

# Skip AI (rules-only, fastest)
lightningdoc report.pdf --no-llm

Web Viewer

lightningdoc --serve
# Open http://127.0.0.1:5050

Upload PDFs, view page images with bounding-box overlays, extract with one click, and see per-stage timing breakdowns.

🏗 Architecture

PDF ──→ Stage 1: Layout Detection     (PyMuPDF, ~2ms/page)
         ├─ Page structure & bboxes
         ├─ Font metadata & columns
         └─ Image positions & reading order

     ──→ Stage 2: Text Extraction      (parallel, ~10ms/page)
         ├─ Native text → Markdown
         ├─ Ligature & encoding repair
         ├─ Multi-strategy Tesseract OCR
         ├─ TrOCR handwriting (optional)
         ├─ EasyOCR fusion (optional)
         └─ Embedded image OCR (concurrent)

     ──→ Stage 3: NLP Post-Processing  (rules + AI)
         ├─ Rule-based OCR corrections
         ├─ GLM-OCR vision judge (optional)
         ├─ SmolLM2 field extraction (fallback)
         └─ Document classification

     ──→ Clean Markdown output

📦 Package Structure

lightningdoc/
├── types.py              # TextSpan, TextBlock, PageLayout, ExtractionResult
├── orchestrator.py       # Pipeline coordinator
├── cli.py                # CLI entry point
├── server.py             # Flask web viewer
├── pipeline/
│   ├── stage1_layout.py  # Layout detection
│   ├── stage2_extract.py # Text extraction + OCR
│   └── stage3_nlp.py     # NLP post-processing
├── preprocessing/
│   ├── ligatures.py      # Unicode ligature repair
│   └── ocr_cleanup.py    # Numeric fix, medical forms
├── models/
│   ├── trocr.py          # TrOCR handwriting model
│   └── glm_ocr.py        # GLM-OCR vision model
└── llm/
    ├── engine.py          # SmolLM2-360M inference
    ├── correction.py      # OCR post-correction
    └── classifier.py      # Document classification

⚡ Performance

Document Type	Pages/sec	Method
Native-text PDF	80+ pages/sec	Layout parsing
Scanned PDF	~1 page/sec	Tesseract OCR (parallel workers)
Handwritten form	~0.15 pages/sec	TrOCR + Tesseract hybrid

CPU-first — no CUDA required
Apple Silicon MPS acceleration for neural models
Parallel OCR workers for scanned pages
Background model preloading (overlaps Stage 1+2)

🔧 Optional Dependencies

Extra	What it adds
`lightningdoc[llm]`	SmolLM2 OCR correction + document classification + GLM-OCR vision judge
`lightningdoc[easyocr]`	EasyOCR fusion for scanned pages
`lightningdoc[all]`	Everything

License

Apache 2.0 — see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.0

Feb 17, 2026

This version

1.0.0

Feb 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightningdoc-1.0.0.tar.gz (53.5 kB view details)

Uploaded Feb 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lightningdoc-1.0.0-py3-none-any.whl (59.5 kB view details)

Uploaded Feb 12, 2026 Python 3

File details

Details for the file lightningdoc-1.0.0.tar.gz.

File metadata

Download URL: lightningdoc-1.0.0.tar.gz
Upload date: Feb 12, 2026
Size: 53.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for lightningdoc-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`545884dd573d0d7947571010a20691cf13cee80b7c87c62469c8e8838bb91e02`
MD5	`cb1ecf9cec6146ec46dc973aa55a2c4c`
BLAKE2b-256	`464be4a3040091320838d6b7af006686712e3e79222f0a291c34696eedc983e6`

See more details on using hashes here.

File details

Details for the file lightningdoc-1.0.0-py3-none-any.whl.

File metadata

Download URL: lightningdoc-1.0.0-py3-none-any.whl
Upload date: Feb 12, 2026
Size: 59.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for lightningdoc-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`25d21aecba96ae08c5ad336b4624c38a7d7fcec9373273191276eb11ee92b403`
MD5	`8af279ec7e5d2d01c0d616205527d3c2`
BLAKE2b-256	`30eec745348e1487d823b36c171f4a11a64f00c3508ff2d3ad21eb318aacb020`

See more details on using hashes here.

lightningdoc 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LightningDoc ⚡

✨ Features

🚀 Quick Start

Installation

Python API

CLI

Web Viewer

🏗 Architecture

📦 Package Structure

⚡ Performance

🔧 Optional Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes