Skip to main content

High-performance 3-stage PDF to Markdown extraction engine with layout detection, multi-strategy OCR, and NLP post-processing.

Project description

LightningDoc ⚡

High-performance 3-stage PDF → Markdown extraction engine.

LightningDoc extracts clean, structured Markdown from any PDF — native text, scanned documents, handwritten forms, or mixed. It combines layout-aware parsing, multi-strategy OCR, and optional AI post-processing into a single pip install.

PyPI version Python 3.10+ License: Apache 2.0


✨ Features

  • 3-Stage Pipeline — Layout Detection → Text Extraction → NLP Post-Processing
  • ~25 ms/page for native-text PDFs (200 pages in < 5 seconds)
  • Multi-strategy OCR — Tesseract (CLAHE+OTSU, contrast+sharpen), TrOCR handwriting, EasyOCR fusion
  • GLM-OCR Vision Judge — 0.9B param multimodal model re-reads scanned pages for higher accuracy
  • SmolLM2 NLP — 360M param local LLM for OCR correction and document classification
  • Zero API keys — everything runs 100% offline after first model download
  • Apple Silicon optimised — MPS acceleration for all neural models
  • Built-in web UI — interactive viewer with bounding-box overlay, upload, extraction dashboard

🚀 Quick Start

Installation

pip install lightningdoc

With AI models (OCR correction, document classification, GLM-OCR):

pip install lightningdoc[llm]

System requirement: Tesseract OCR must be installed separately:

  • macOS: brew install tesseract
  • Ubuntu: sudo apt install tesseract-ocr
  • Windows: installer

Python API

from pathlib import Path
from lightningdoc import extract_text_from_pdf, extract_with_timing

# Simple text extraction
text = extract_text_from_pdf(Path("report.pdf"))
print(text)

# With full timing breakdown
result = extract_with_timing(Path("report.pdf"))
print(f"{result.word_count} words in {result.total_ms:.0f}ms")
print(f"Stage 1 (layout):  {result.stage1_ms:.0f}ms")
print(f"Stage 2 (extract): {result.stage2_ms:.0f}ms")
print(f"Stage 3 (NLP):     {result.stage3_ms:.0f}ms")
print(f"Document type:     {result.doc_type}")

CLI

# Extract a PDF to Markdown
lightningdoc report.pdf

# Batch extract
lightningdoc *.pdf -o ./output

# With GLM-OCR vision judge (for scanned docs)
lightningdoc scanned.pdf --glm-ocr

# With TrOCR handwriting recognition
lightningdoc form.pdf --trocr

# Skip AI (rules-only, fastest)
lightningdoc report.pdf --no-llm

Web Viewer

lightningdoc --serve
# Open http://127.0.0.1:5050

Upload PDFs, view page images with bounding-box overlays, extract with one click, and see per-stage timing breakdowns.


🏗 Architecture

PDF ──→ Stage 1: Layout Detection     (PyMuPDF, ~2ms/page)
         ├─ Page structure & bboxes
         ├─ Font metadata & columns
         └─ Image positions & reading order

     ──→ Stage 2: Text Extraction      (parallel, ~10ms/page)
         ├─ Native text → Markdown
         ├─ Ligature & encoding repair
         ├─ Multi-strategy Tesseract OCR
         ├─ TrOCR handwriting (optional)
         ├─ EasyOCR fusion (optional)
         └─ Embedded image OCR (concurrent)

     ──→ Stage 3: NLP Post-Processing  (rules + AI)
         ├─ Rule-based OCR corrections
         ├─ GLM-OCR vision judge (optional)
         ├─ SmolLM2 field extraction (fallback)
         └─ Document classification

     ──→ Clean Markdown output

📦 Package Structure

lightningdoc/
├── types.py              # TextSpan, TextBlock, PageLayout, ExtractionResult
├── orchestrator.py       # Pipeline coordinator
├── cli.py                # CLI entry point
├── server.py             # Flask web viewer
├── pipeline/
│   ├── stage1_layout.py  # Layout detection
│   ├── stage2_extract.py # Text extraction + OCR
│   └── stage3_nlp.py     # NLP post-processing
├── preprocessing/
│   ├── ligatures.py      # Unicode ligature repair
│   └── ocr_cleanup.py    # Numeric fix, medical forms
├── models/
│   ├── trocr.py          # TrOCR handwriting model
│   └── glm_ocr.py        # GLM-OCR vision model
└── llm/
    ├── engine.py          # SmolLM2-360M inference
    ├── correction.py      # OCR post-correction
    └── classifier.py      # Document classification

⚡ Performance

Document Type Pages/sec Method
Native-text PDF 80+ pages/sec Layout parsing
Scanned PDF ~1 page/sec Tesseract OCR (parallel workers)
Handwritten form ~0.15 pages/sec TrOCR + Tesseract hybrid
  • CPU-first — no CUDA required
  • Apple Silicon MPS acceleration for neural models
  • Parallel OCR workers for scanned pages
  • Background model preloading (overlaps Stage 1+2)

🔧 Optional Dependencies

Extra What it adds
lightningdoc[llm] SmolLM2 OCR correction + document classification + GLM-OCR vision judge
lightningdoc[easyocr] EasyOCR fusion for scanned pages
lightningdoc[all] Everything

License

Apache 2.0 — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightningdoc-1.0.0.tar.gz (53.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lightningdoc-1.0.0-py3-none-any.whl (59.5 kB view details)

Uploaded Python 3

File details

Details for the file lightningdoc-1.0.0.tar.gz.

File metadata

  • Download URL: lightningdoc-1.0.0.tar.gz
  • Upload date:
  • Size: 53.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for lightningdoc-1.0.0.tar.gz
Algorithm Hash digest
SHA256 545884dd573d0d7947571010a20691cf13cee80b7c87c62469c8e8838bb91e02
MD5 cb1ecf9cec6146ec46dc973aa55a2c4c
BLAKE2b-256 464be4a3040091320838d6b7af006686712e3e79222f0a291c34696eedc983e6

See more details on using hashes here.

File details

Details for the file lightningdoc-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: lightningdoc-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 59.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for lightningdoc-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 25d21aecba96ae08c5ad336b4624c38a7d7fcec9373273191276eb11ee92b403
MD5 8af279ec7e5d2d01c0d616205527d3c2
BLAKE2b-256 30eec745348e1487d823b36c171f4a11a64f00c3508ff2d3ad21eb318aacb020

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page