LLM-only, agentic layout & text extraction to Markdown/Text/Layout JSON
Project description
LayoutScribe
LLM-powered layout & text extraction for PDFs, slides, and Word docs
LLM-only, agentic parser that converts PDF / PPTX / DOCX into clean Markdown, plain text, and layout JSON (with normalized bounding boxes).
Built with LangGraph (agent orchestration), LiteLLM (provider-agnostic multimodal calls), and MLflow (tracing).
No OCR engines, no heuristic parsers. Rendering to images is allowed; all structure and text understanding is done by a multimodal LLM.
Features (0.1)
- Inputs: PDF, PPTX, DOCX (rendered pages/slides as images)
- Outputs:
- Markdown (headings, lists, tables, captions)
- Plain text
- Layout JSON (
blockswithtype,bbox[0..1],text,conf)
- Agentic pipeline: planner → page_vision (async) → reviewer (validate/re-ask) → composer
- Robustness:
- Re-ask on schema/geometry violations (IoU/coverage checks)
- Fallback injection when LLM returns empty content so Markdown is never blank
- Provider-agnostic via LiteLLM (OpenAI, Azure OpenAI, Claude, Gemini)
- MLflow tracing for params, metrics, artifacts
Status
0.1 (alpha) released — see CHANGELOG.md and docs/ROADMAP.md.
Quick Links
- docs/ARCHITECTURE.md – modules & flow
- docs/PROMPTS_AND_SCHEMA.md – prompt rules and schema notes
- docs/schema/layout_page.schema.json – formal JSON Schema (Draft 2020-12)
- docs/CONFIGURATION.md – env vars, provider-specific setup, .env example
- docs/API_SPEC.md / docs/CLI_SPEC.md – contracts & examples
- docs/BENCHMARKS.md – datasets & metrics
- docs/TESTING_STRATEGY.md – testing plan & commands
- docs/PROVIDERS.md – model matrix & concurrency guidance
- docs/SECURITY.md – keys, artifacts, and vulnerability reporting
- docs/ROADMAP.md – milestones
- CONTRIBUTING.md – how to help
- CHANGELOG.md – notable changes
Installation
Requires Python 3.10+.
pip install layoutscribe
Optional extras:
# Office file support (PPTX/DOCX rendering via python-pptx / python-docx)
pip install "layoutscribe[office]"
# Development tools (ruff, black, pytest)
pip install "layoutscribe[dev]"
Runtime notes:
- PDF rendering: PyMuPDF (included)
- PPTX/DOCX support:
python-pptx,python-docx(install with[office])
Getting Started
Set provider keys as environment variables (see CONFIGURATION.md). Example .env:
OPENAI_API_KEY=sk-...
LAYOUTSCRIBE_DPI=180
Quickstart
CLI
layoutscribe parse ./samples/report.pdf \
--llm openai/gpt-4o \
--outputs markdown text layout_json \
--output-dir ./artifacts/report \
--dpi 180 --parallel-pages 6 --budget-usd 0.50
Python API
import asyncio
from layoutscribe.api import parse as ls_parse
async def main() -> None:
doc = await ls_parse(
path="samples/report.pdf",
outputs=["markdown", "text", "layout_json"],
llm="openai/gpt-4o",
dpi=180,
parallel_pages=6,
budget_usd=0.50,
save_intermediate=True,
)
print(doc.metadata)
print(doc.markdown[:1000])
if __name__ == "__main__":
asyncio.run(main())
Outputs & Artifacts
./artifacts/report/
document.md
document.txt
layout.json
overlays/
page-0001.png
page-0002.png
intermediate/
page-0001.json
Configuration
See docs/CONFIGURATION.md for provider-specific env vars, defaults, and precedence. MLflow tracing is opt-in via --trace-mlflow.
LiteLLM provider setup
LiteLLM reads provider keys from environment variables. Set only those you need:
# OpenAI
OPENAI_API_KEY=sk-...
# Azure OpenAI
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://<your-resource>.openai.azure.com/
AZURE_OPENAI_API_VERSION=2024-02-15-preview
# Anthropic
ANTHROPIC_API_KEY=...
# Google (Gemini)
GOOGLE_API_KEY=...
Use --llm to pick a model via LiteLLM:
--llm openai/gpt-4o
--llm azure/<deployment_name>
--llm anthropic/claude-3.5-sonnet
--llm google/gemini-1.5-pro
Notes:
- For Azure, ensure the deployment name references a vision-capable model and that your endpoint/API version are set.
- Keep temperature low (0–0.2) for consistent JSON.
- Respect provider rate limits; we use retries with exponential backoff.
Limitations (0.1)
- No OCR engines; relies entirely on a multimodal LLM
- Basic tables only (CSV-like); no complex rowspan/colspan recovery
- No handwriting support; language translation out of scope
- Confidence scores (if present) are heuristic and not calibrated
Community & Support
- Open issues and discussions on GitHub
- For security concerns, follow SECURITY.md (use private advisories)
License
Apache-2.0 (see LICENSE).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file layoutscribe-0.1.0a3.tar.gz.
File metadata
- Download URL: layoutscribe-0.1.0a3.tar.gz
- Upload date:
- Size: 29.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4296d797e6a4b7518f6657219980ee2dc7779b659990e8b8c6836735a83ce6e9
|
|
| MD5 |
8317c004e57f5c0fc20dd062a074fa07
|
|
| BLAKE2b-256 |
f558a1cf32509f0fb0d1717fa3d31d2a128eb2d253f734c244f7a235f61a9fca
|
File details
Details for the file layoutscribe-0.1.0a3-py3-none-any.whl.
File metadata
- Download URL: layoutscribe-0.1.0a3-py3-none-any.whl
- Upload date:
- Size: 35.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
495b898027410c6a1e708a04ba418b7249f4d56d1add4d45de4c5fa84101f725
|
|
| MD5 |
2d9b0cea9e2f4fbaf46d5fcda6b355de
|
|
| BLAKE2b-256 |
afb0c48d3766cff7ab7e9a3b1a4005c41a04c74c789d431a72a4e4623a1d0e16
|