Convert PDFs and Office documents to LLM-friendly markdown, with diagram extraction to Mermaid.
Project description
pagespeak
Convert PDF, Word, and other office formats to LLM-friendly markdown — with diagram extraction to Mermaid.
Why
pagespeak began as the ingestion step for a retrieval system built from a large, diverse body of real documents — equipment and software manuals, textbooks, how-to guides, and HTML help sites. Every off-the-shelf converter produced markdown that looked clean and retrieved badly: heading hierarchy flattened so it couldn't be split into coherent chunks, wide tables collapsed into one cell, diagrams reduced to image refs with no text, HTML entities left undecoded. The extraction looked finished; the output wasn't usable.
Nothing in the mainstream stack closes that gap — the usual answer is to ignore structure and split on a fixed token window. So pagespeak became the layer that does. It delegates extraction to Marker, Docling, or MarkItDown and adds the passes that prepare the output for LLM/RAG use — each one a fix for a specific way a real document broke, found by converting it and reading the output, one corpus defect at a time:
- Diagrams → Mermaid, illustrations → labeled captions. Extractors leave an embedded image with no text for retrieval to match. pagespeak sends each to a vision model: structural graphics (flowcharts, signal flow, state and sequence diagrams) become a tagged Mermaid block an LLM can read and edit; morphological figures where the picture itself is the information (anatomical illustrations, micrographs, charts) instead get a dense caption that transcribes the visible labels. The original image is always kept beside it.
- Heading repair, then section split. It renormalizes flattened heading levels and splits the document into per-section files with in-text breadcrumbs, so each chunk carries its parent context.
- Decoration stripping, content-keyed caching, cost controls, and an output audit.
pagespeak auditscans converted markdown for the defects above and reports them; it runs on any markdown tree, not just pagespeak's.
Extraction stays delegated — pagespeak does not try to out-parse Marker or Docling. Wide tables are the main weak spot: when Marker collapses a multi-column spec table into a single cell, pagespeak repair-tables re-reads just that page with Docling and splices the clean grid back into the Marker output, rather than re-converting the whole document.
That structure is the payoff. It lets a compound, cross-document question — one needing the right sections from several manuals at once — be answered from a few thousand relevant tokens, every step traceable to the manual it came from, instead of from documents too large to load.
It is new as a public project, not as code: these passes were hardened one real document at a time — convert, read the output, fix what broke — against a large working corpus well before this release. The worked examples below are the receipts.
See it on real documents — docs/worked-examples.md. One 18-page textbook chapter run through raw Marker, raw Docling, and pagespeak side by side (the heading-repair before/after, a real diagram→Mermaid, and a retrieval query the figure answers but the prose can't), then a 68-manual / ~6.1M-token library where one compound question is answered from ~1,600 retrieved tokens across three manuals — with the dated models and the honest misses reported, not hidden. It's the empirical case for everything above, and the best way to judge whether this fits your corpus.
Scope — where pagespeak fits
pagespeak is the ingestion and structuring stage of a retrieval pipeline, not a whole one. It converts documents to clean, per-section markdown with breadcrumbs and provenance — and stops there. It does not do embeddings, vector storage, retrieval, or a query/chat layer; pair it with your own vector DB and retrieval framework (LlamaIndex, LangChain, Haystack, or hand-rolled).
One thing to know going in: the section split is structural, not size-based. Sections are cut at heading boundaries — there is no token budget, no max-chunk size, and no overlap. A long section stays one file; a near-empty one is dropped (min_body_chars). That makes each file a coherent, self-locating unit — which is what you want feeding an embedder — but if your retrieval needs uniformly-sized chunks, add a token-aware splitter downstream. The structure pagespeak recovers (correct heading levels, in-text breadcrumbs) is exactly what makes that downstream chunking clean.
Install
Pre-PyPI: the first PyPI release is pending, so
pip install pagespeakis not live yet. Until it lands, install from source —pip install "pagespeak @ git+https://github.com/phierceweb/pagespeak", adding extras as"pagespeak[pdf] @ git+https://github.com/phierceweb/pagespeak". Once published, the commands below work as written.
pip install pagespeak # DOCX/PPTX/XLSX/HTML/CSV/JSON/...
pip install pagespeak[pdf] # adds Marker for PDF (default)
pip install pagespeak[pdf-docling] # adds Docling for PDF (accuracy-first)
pip install pagespeak[pdf,pdf-docling] # both — pick at call time
pip install pagespeak[docx-structured] # adds python-docx for structure-faithful DOCX
pip install pagespeak[pdf,docx-structured] # PDF + structure-faithful DOCX
pip install pagespeak[tophat] # adds the Top Hat quiz-export backend (light; pypdfium2)
pip install pagespeak[web] # localhost web console (FastAPI + uvicorn)
Pagespeak builds on pf-core for its LLM clients (Anthropic / Claude Code / OpenRouter), structured logging, pipeline manifest helpers, CLI subcommand factories, and atomic-write utilities. pf-core[image-phash] is pulled in transitively — no separate install step required.
Quickstart
from pagespeak import to_markdown
result = to_markdown("manual.pdf", output_dir="./out", diagrams=True)
result.markdown # final markdown with mermaid blocks embedded
result.images # list[Path] of extracted images
result.diagrams # list[Diagram(image_path, caption, mermaid, diagram_type)]
# One command: ingest + Phase 3 (cleanup, normalize, vision, split)
pagespeak convert manual.pdf -o ./out
pagespeak convert report.docx -o ./out --no-diagrams
# Two commands: backend phase separately, iterate Phase 3
pagespeak ingest thick.pdf -o ./out --workers 4 # chunked-parallel Marker
pagespeak convert ./out --normalize-headings # Phase 3 on existing raw.md
For RAG-shaped output (split into per-section files with sensible defaults):
pagespeak convert manual.pdf -o ./out --preset rag-default
rag-default's heading mode is heuristic, which is right for cleanly-numbered documents (textbooks, specs with 1.1/1.2 sections) — but it skips on un-numbered manuals, where it leaves the flattened hierarchy in place. For those (most consumer-electronics / AV / software manuals), add the LLM heading-repair pass — this is the canonical recipe for a real manual corpus:
pagespeak convert manual.pdf -o ./out --preset rag-default --normalize-headings-mode llm_full
See docs/presets.md for the five built-in presets and docs/choosing-defaults.md for the per-document-type triage (when to add llm_full, --device cpu, page-ranging, and more).
Output shape
For each diagram detected, the caption goes in the image's alt text and a tagged Mermaid block follows:

```mermaid pagespeak-image="images/_page_30_Figure_4.jpeg"
flowchart TD
A["Body temperature exceeds 37°C"]
B["Nerve cells in skin and brain"]
C["Temperature regulatory center in brain"]
D["Sweat glands throughout body"]
A --> B
B --> C
C --> D
D -.-> A
```
- Captions live in alt text — extractable without parsing prose, read by screen readers.
- Mermaid blocks tag their source image with
pagespeak-image="<path>"on the fenced-block info string. Renderers ignore the tag; parsers can pair Mermaid with the image it was generated from. - Non-structural images — photos, screenshots, and morphological figures (anatomy, micrographs, chemical structures, charts) — get a caption instead of Mermaid; label-bearing illustrations get a caption that transcribes their visible labels.
- Repeated decorations (page headers, footer logos) are detected via perceptual-hash clustering and stripped from the consolidated markdown.
See it on a real document: docs/worked-examples.md runs one chapter of a CC-BY textbook through raw Marker, raw Docling, and pagespeak — the heading-repair before/after, a diagram→Mermaid, and a retrieval query the figure answers but the prose can't.
Vision backends
| Backend | When to use | Auth |
|---|---|---|
claude_code (default) |
$0/call via a Claude Code subscription | claude binary on PATH |
anthropic |
Direct API; fastest | ANTHROPIC_API_KEY |
openrouter |
Multi-provider unified billing (Gemini, Llama vision, …) | OPENROUTER_API_KEY |
The default model is Claude Haiku 4.5 — $0 on the default claude_code backend, or typically $0.001–$0.005 per image on a paid backend. See docs/diagrams.md for backend mechanics, prompt versioning, and failure handling.
Vision output is best-effort. A diagram's Mermaid is a model's reading of the image — usually faithful for clean structural figures, but it can be approximate or wrong on dense, hand-drawn, or low-resolution ones, and a confidently-wrong caption is worse than none. pagespeak keeps the original image beside every block, biases the prompt toward caption-only when a figure isn't cleanly structural, and for critical content you should spot-check the Mermaid against the source rather than trust it blindly. The worked examples report the real per-figure hit rate (e.g. organs named for 7 of 11 body systems), not a perfect one.
Format support
| Format | Backend |
|---|---|
.pdf |
Marker (default, fast) or Docling (accuracy-first). See docs/backends.md. |
.docx, .pptx, .xlsx, .html, .htm, .csv, .json, .xml, .epub |
MarkItDown |
Canvas QTI quiz export (directory or .imscc) |
Built-in QTI backend → one markdown file per quiz with the answer key. See docs/canvas-quizzes.md. |
| Top Hat quiz-export PDF | --pdf-backend tophat → one ## Question N block per question, correct answer marked when revealed, embedded figures extracted + captioned. See docs/tophat-quizzes.md. |
How it relates to other tools
pagespeak is not a parser — it wraps existing extractors and runs cleanup, structuring, and diagram passes around their output.
| Tool | What it is | How pagespeak relates |
|---|---|---|
| MarkItDown, Marker, Docling | Open-source document → markdown extractors | Used as pagespeak's backends; pagespeak runs heading repair, section splitting, decoration stripping, and diagram→Mermaid on their output |
| LlamaParse, Reducto, Mathpix | Hosted, paid extraction APIs for complex documents | Different model — pagespeak runs locally on the extractors above, with optional $0 vision via Claude Code |
| Unstructured | Partitions documents into typed elements for RAG frameworks | Different output — pagespeak emits per-section markdown files with breadcrumbs and embedded Mermaid |
Why a layer at all — heading fidelity
The hardest part of PDF→markdown for RAG is the heading tree, because that's what chunking splits on. PDFs don't store semantic heading levels — only font sizes — so every extractor guesses, and each flattens or mis-levels real documents in a different way:
| Heading hierarchy | Tables | Figures / formulas | |
|---|---|---|---|
| Marker (default) | 4-level pyramid in single-shot; flattens in the chunked pipeline (per-chunk font stats disagree). MPS crash on Apple Silicon → --device cpu |
occasionally collapses a multi-column table into one cell | — |
| Docling | capped at 2 levels by design — its layout model labels every section heading level=1 |
well-formed, TableFormer-grade | ~25% more figures on textbooks; formula → LaTeX |
So no backend simply gets structure right, and "just use Marker" or "just use Docling" inherits that backend's specific failure. Recovering the structure is pagespeak's reason to exist: an optional LLM heading-renormalization stage rebuilds a flattened hierarchy, deterministic post-passes repair levels at $0, and repair-tables re-reads just a collapsed table's page through Docling and splices the clean grid back into Marker's output — one page, not a second full conversion. Pick the backend for its strengths; pagespeak patches its known weakness. The full trade-off and recipes: docs/backends.md and docs/choosing-defaults.md.
Docs
- docs/pipeline.md — stage-by-stage walkthrough of what every command runs (spine)
- docs/worked-examples.md — end-to-end before/after on real documents: extraction, repair, retrieval, and the cross-document payoff
- docs/usage.md — library + CLI examples, kwargs, env vars, common recipes
- docs/choosing-defaults.md — pre-ingest triage: canonical recipe, vendor patterns, when to deviate
- docs/presets.md — config presets and
<output>/.pagespeak-run.json - docs/architecture.md — module layout, data flow
- docs/diagrams.md — vision pass, prompt versioning
- docs/cleanup.md — cleanup levels, cross-refs, section splitting
- docs/normalize-headings.md — heading-level renormalization
- docs/audit.md —
pagespeak audit: scan converted output for conversion defects (read-only, $0) - docs/repair-tables.md —
pagespeak repair-tables: splice Docling's clean grid into Marker-collapsed tables (the fix for the audit'scollapsed_table) - docs/caching.md — cache layers,
--rerun-from, baselines, diff - docs/backends.md — Marker vs Docling for PDF
- docs/docx-backends.md — MarkItDown vs python-docx for DOCX
- docs/ingest.md —
pagespeak ingest, chunked-parallel workers, resume semantics - docs/format-support.md — per-format quirks
- docs/canvas-quizzes.md — Canvas QTI quiz exports → one markdown file per quiz
- docs/tophat-quizzes.md — Top Hat quiz-export PDFs → per-question markdown (
--pdf-backend tophat) - docs/operations.md — sandbox /
ProcessPoolExecutorgotchas - docs/web.md — web console: upload/queue, per-phase cockpit, cost gate, LLM observability
Security
SECURITY.md covers vulnerability reporting and safe-usage notes for shared environments (the console has no auth; remote-image fetching is SSRF-guarded).
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pagespeak-0.1.0.tar.gz.
File metadata
- Download URL: pagespeak-0.1.0.tar.gz
- Upload date:
- Size: 426.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b9aff08f5245b8be60d542c1925604d8800f6d798cfd1c14d047fc878951aef
|
|
| MD5 |
492b7a8a51016384d551a6136d654da3
|
|
| BLAKE2b-256 |
50afc9c50736ea34d21fc219d0b6982926195e9d95a65da89eb076f2b462ebe5
|
Provenance
The following attestation bundles were made for pagespeak-0.1.0.tar.gz:
Publisher:
publish.yml on phierceweb/pagespeak
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pagespeak-0.1.0.tar.gz -
Subject digest:
1b9aff08f5245b8be60d542c1925604d8800f6d798cfd1c14d047fc878951aef - Sigstore transparency entry: 1848521414
- Sigstore integration time:
-
Permalink:
phierceweb/pagespeak@90d5abcbd78dba51b35b35e6ec907afb268a4774 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/phierceweb
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@90d5abcbd78dba51b35b35e6ec907afb268a4774 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pagespeak-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pagespeak-0.1.0-py3-none-any.whl
- Upload date:
- Size: 296.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7101e968626bb2ba9b8fc0ead50807b69d7d218d539a2ec76ffbd36c44429a9
|
|
| MD5 |
dee963ca243d71345dbacc579fcb5f62
|
|
| BLAKE2b-256 |
ccec6ac280bec63959b459399fc4e4990fa6ac7e4844ddf4ead2222667c9a4f2
|
Provenance
The following attestation bundles were made for pagespeak-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on phierceweb/pagespeak
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pagespeak-0.1.0-py3-none-any.whl -
Subject digest:
b7101e968626bb2ba9b8fc0ead50807b69d7d218d539a2ec76ffbd36c44429a9 - Sigstore transparency entry: 1848521536
- Sigstore integration time:
-
Permalink:
phierceweb/pagespeak@90d5abcbd78dba51b35b35e6ec907afb268a4774 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/phierceweb
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@90d5abcbd78dba51b35b35e6ec907afb268a4774 -
Trigger Event:
push
-
Statement type: