Convert PDFs and Office documents to LLM-friendly markdown, with diagram extraction to Mermaid.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

phierceweb

These details have not been verified by PyPI

Project description

pagespeak

Convert PDF, Word, and other office formats to LLM-friendly markdown — with diagram extraction to Mermaid.

Why

pagespeak began as the ingestion step for a retrieval system built from a large, diverse body of real documents — equipment and software manuals, textbooks, how-to guides, and HTML help sites. Every off-the-shelf converter produced markdown that looked clean and retrieved badly: heading hierarchy flattened so it couldn't be split into coherent chunks, wide tables collapsed into one cell, diagrams reduced to image refs with no text, HTML entities left undecoded. The extraction looked finished; the output wasn't usable.

Nothing in the mainstream stack closes that gap — the usual answer is to ignore structure and split on a fixed token window. So pagespeak became the layer that does. It delegates extraction to Marker, Docling, or MarkItDown and adds the passes that prepare the output for LLM/RAG use — each one a fix for a specific way a real document broke, found by converting it and reading the output, one corpus defect at a time:

Diagrams → Mermaid, illustrations → labeled captions. Extractors leave an embedded image with no text for retrieval to match. pagespeak sends each to a vision model: structural graphics (flowcharts, signal flow, state and sequence diagrams) become a tagged Mermaid block an LLM can read and edit; morphological figures where the picture itself is the information (anatomical illustrations, micrographs, charts) instead get a dense caption that transcribes the visible labels. The original image is always kept beside it.
Heading repair, then section split. It renormalizes flattened heading levels and splits the document into per-section files with in-text breadcrumbs, so each chunk carries its parent context.
Decoration stripping, content-keyed caching, cost controls, and an output audit. pagespeak audit scans converted markdown for the defects above and reports them; it runs on any markdown tree, not just pagespeak's.

Extraction stays delegated — pagespeak does not try to out-parse Marker or Docling. Wide tables are the main weak spot: when Marker collapses a multi-column spec table into a single cell, pagespeak repair-tables re-reads just that page with Docling and splices the clean grid back into the Marker output, rather than re-converting the whole document.

That structure is the payoff. It lets a compound, cross-document question — one needing the right sections from several manuals at once — be answered from a few thousand relevant tokens, every step traceable to the manual it came from, instead of from documents too large to load.

It is new as a public project, not as code: these passes were hardened one real document at a time — convert, read the output, fix what broke — against a large working corpus well before this release. The worked examples below are the receipts.

See it on real documents — docs/worked-examples.md. One 18-page textbook chapter run through raw Marker, raw Docling, and pagespeak side by side (the heading-repair before/after, a real diagram→Mermaid, and a retrieval query the figure answers but the prose can't), then a 68-manual / ~6.1M-token library where one compound question is answered from ~1,600 retrieved tokens across three manuals — with the dated models and the honest misses reported, not hidden. It's the empirical case for everything above, and the best way to judge whether this fits your corpus.

Scope — where pagespeak fits

pagespeak is the ingestion and structuring stage of a retrieval pipeline, not a whole one. It converts documents to clean, per-section markdown with breadcrumbs and provenance — and stops there. It does not do embeddings, vector storage, retrieval, or a query/chat layer; pair it with your own vector DB and retrieval framework (LlamaIndex, LangChain, Haystack, or hand-rolled).

One thing to know going in: the section split is structural, not size-based. Sections are cut at heading boundaries — there is no token budget, no max-chunk size, and no overlap. A long section stays one file; a near-empty one is dropped (min_body_chars). That makes each file a coherent, self-locating unit — which is what you want feeding an embedder — but if your retrieval needs uniformly-sized chunks, add a token-aware splitter downstream. The structure pagespeak recovers (correct heading levels, in-text breadcrumbs) is exactly what makes that downstream chunking clean.

Install

Pre-PyPI: the first PyPI release is pending, so pip install pagespeak is not live yet. Until it lands, install from source — pip install "pagespeak @ git+https://github.com/phierceweb/pagespeak", adding extras as "pagespeak[pdf] @ git+https://github.com/phierceweb/pagespeak". Once published, the commands below work as written.

pip install pagespeak                       # DOCX/PPTX/XLSX/HTML/CSV/JSON/...
pip install pagespeak[pdf]                  # adds Marker for PDF (default)
pip install pagespeak[pdf-docling]          # adds Docling for PDF (accuracy-first)
pip install pagespeak[pdf,pdf-docling]      # both — pick at call time
pip install pagespeak[docx-structured]      # adds python-docx for structure-faithful DOCX
pip install pagespeak[pdf,docx-structured]  # PDF + structure-faithful DOCX
pip install pagespeak[tophat]               # adds the Top Hat quiz-export backend (light; pypdfium2)
pip install pagespeak[web]                  # localhost web console (FastAPI + uvicorn)

Pagespeak builds on pf-core for its LLM clients (Anthropic / Claude Code / OpenRouter), structured logging, pipeline manifest helpers, CLI subcommand factories, and atomic-write utilities. pf-core[image-phash] is pulled in transitively — no separate install step required.

Quickstart

from pagespeak import to_markdown

result = to_markdown("manual.pdf", output_dir="./out", diagrams=True)

result.markdown   # final markdown with mermaid blocks embedded
result.images     # list[Path] of extracted images
result.diagrams   # list[Diagram(image_path, caption, mermaid, diagram_type)]

# One command: ingest + Phase 3 (cleanup, normalize, vision, split)
pagespeak convert manual.pdf -o ./out
pagespeak convert report.docx -o ./out --no-diagrams

# Two commands: backend phase separately, iterate Phase 3
pagespeak ingest thick.pdf -o ./out --workers 4   # chunked-parallel Marker
pagespeak convert ./out --normalize-headings      # Phase 3 on existing raw.md

For RAG-shaped output (split into per-section files with sensible defaults):

pagespeak convert manual.pdf -o ./out --preset rag-default

rag-default's heading mode is heuristic, which is right for cleanly-numbered documents (textbooks, specs with 1.1/1.2 sections) — but it skips on un-numbered manuals, where it leaves the flattened hierarchy in place. For those (most consumer-electronics / AV / software manuals), add the LLM heading-repair pass — this is the canonical recipe for a real manual corpus:

pagespeak convert manual.pdf -o ./out --preset rag-default --normalize-headings-mode llm_full

See docs/presets.md for the five built-in presets and docs/choosing-defaults.md for the per-document-type triage (when to add llm_full, --device cpu, page-ranging, and more).

Output shape

For each diagram detected, the caption goes in the image's alt text and a tagged Mermaid block follows:

![Two-panel figure showing negative feedback control loops. Panel (a) presents the general model: a stimulus activates a sensor, which signals a control center that directs an effector, with a feedback loop returning to the stimulus. Panel (b) demonstrates thermoregulation: elevated body temperature is detected by nerve cells, processed by the brain's temperature regulatory center, and triggers sweating to cool the body.](images/_page_30_Figure_4.jpeg)

```mermaid pagespeak-image="images/_page_30_Figure_4.jpeg"
flowchart TD
    A["Body temperature exceeds 37°C"]
    B["Nerve cells in skin and brain"]
    C["Temperature regulatory center in brain"]
    D["Sweat glands throughout body"]
    A --> B
    B --> C
    C --> D
    D -.-> A
```

Captions live in alt text — extractable without parsing prose, read by screen readers.
Mermaid blocks tag their source image with pagespeak-image="<path>" on the fenced-block info string. Renderers ignore the tag; parsers can pair Mermaid with the image it was generated from.
Non-structural images — photos, screenshots, and morphological figures (anatomy, micrographs, chemical structures, charts) — get a caption instead of Mermaid; label-bearing illustrations get a caption that transcribes their visible labels.
Repeated decorations (page headers, footer logos) are detected via perceptual-hash clustering and stripped from the consolidated markdown.

See it on a real document: docs/worked-examples.md runs one chapter of a CC-BY textbook through raw Marker, raw Docling, and pagespeak — the heading-repair before/after, a diagram→Mermaid, and a retrieval query the figure answers but the prose can't.

Vision backends

Backend	When to use	Auth
`claude_code` (default)	$0/call via a Claude Code subscription	`claude` binary on PATH
`anthropic`	Direct API; fastest	`ANTHROPIC_API_KEY`
`openrouter`	Multi-provider unified billing (Gemini, Llama vision, …)	`OPENROUTER_API_KEY`

The default model is Claude Haiku 4.5 — $0 on the default claude_code backend, or typically $0.001–$0.005 per image on a paid backend. See docs/diagrams.md for backend mechanics, prompt versioning, and failure handling.

Vision output is best-effort. A diagram's Mermaid is a model's reading of the image — usually faithful for clean structural figures, but it can be approximate or wrong on dense, hand-drawn, or low-resolution ones, and a confidently-wrong caption is worse than none. pagespeak keeps the original image beside every block, biases the prompt toward caption-only when a figure isn't cleanly structural, and for critical content you should spot-check the Mermaid against the source rather than trust it blindly. The worked examples report the real per-figure hit rate (e.g. organs named for 7 of 11 body systems), not a perfect one.

Format support

Format	Backend
`.pdf`	Marker (default, fast) or Docling (accuracy-first). See docs/backends.md.
`.docx`, `.pptx`, `.xlsx`, `.html`, `.htm`, `.csv`, `.json`, `.xml`, `.epub`	MarkItDown
Canvas QTI quiz export (directory or `.imscc`)	Built-in QTI backend → one markdown file per quiz with the answer key. See docs/canvas-quizzes.md.
Top Hat quiz-export PDF	`--pdf-backend tophat` → one `## Question N` block per question, correct answer marked when revealed, embedded figures extracted + captioned. See docs/tophat-quizzes.md.

How it relates to other tools

pagespeak is not a parser — it wraps existing extractors and runs cleanup, structuring, and diagram passes around their output.

Tool	What it is	How pagespeak relates
MarkItDown, Marker, Docling	Open-source document → markdown extractors	Used as pagespeak's backends; pagespeak runs heading repair, section splitting, decoration stripping, and diagram→Mermaid on their output
LlamaParse, Reducto, Mathpix	Hosted, paid extraction APIs for complex documents	Different model — pagespeak runs locally on the extractors above, with optional $0 vision via Claude Code
Unstructured	Partitions documents into typed elements for RAG frameworks	Different output — pagespeak emits per-section markdown files with breadcrumbs and embedded Mermaid

Why a layer at all — heading fidelity

The hardest part of PDF→markdown for RAG is the heading tree, because that's what chunking splits on. PDFs don't store semantic heading levels — only font sizes — so every extractor guesses, and each flattens or mis-levels real documents in a different way:

	Heading hierarchy	Tables	Figures / formulas
Marker (default)	4-level pyramid in single-shot; flattens in the chunked pipeline (per-chunk font stats disagree). MPS crash on Apple Silicon → `--device cpu`	occasionally collapses a multi-column table into one cell	—
Docling	capped at 2 levels by design — its layout model labels every section heading `level=1`	well-formed, TableFormer-grade	~25% more figures on textbooks; formula → LaTeX

So no backend simply gets structure right, and "just use Marker" or "just use Docling" inherits that backend's specific failure. Recovering the structure is pagespeak's reason to exist: an optional LLM heading-renormalization stage rebuilds a flattened hierarchy, deterministic post-passes repair levels at $0, and repair-tables re-reads just a collapsed table's page through Docling and splices the clean grid back into Marker's output — one page, not a second full conversion. Pick the backend for its strengths; pagespeak patches its known weakness. The full trade-off and recipes: docs/backends.md and docs/choosing-defaults.md.

Docs

docs/pipeline.md — stage-by-stage walkthrough of what every command runs (spine)
docs/worked-examples.md — end-to-end before/after on real documents: extraction, repair, retrieval, and the cross-document payoff
docs/usage.md — library + CLI examples, kwargs, env vars, common recipes
docs/choosing-defaults.md — pre-ingest triage: canonical recipe, vendor patterns, when to deviate
docs/presets.md — config presets and <output>/.pagespeak-run.json
docs/architecture.md — module layout, data flow
docs/diagrams.md — vision pass, prompt versioning
docs/cleanup.md — cleanup levels, cross-refs, section splitting
docs/normalize-headings.md — heading-level renormalization
docs/audit.md — pagespeak audit: scan converted output for conversion defects (read-only, $0)
docs/repair-tables.md — pagespeak repair-tables: splice Docling's clean grid into Marker-collapsed tables (the fix for the audit's collapsed_table)
docs/caching.md — cache layers, --rerun-from, baselines, diff
docs/backends.md — Marker vs Docling for PDF
docs/docx-backends.md — MarkItDown vs python-docx for DOCX
docs/ingest.md — pagespeak ingest, chunked-parallel workers, resume semantics
docs/format-support.md — per-format quirks
docs/canvas-quizzes.md — Canvas QTI quiz exports → one markdown file per quiz
docs/tophat-quizzes.md — Top Hat quiz-export PDFs → per-question markdown (--pdf-backend tophat)
docs/operations.md — sandbox / ProcessPoolExecutor gotchas
docs/web.md — web console: upload/queue, per-phase cockpit, cost gate, LLM observability

Security

SECURITY.md covers vulnerability reporting and safe-usage notes for shared environments (the console has no auth; remote-image fetching is SSRF-guarded).

License

MIT.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

phierceweb

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pagespeak-0.1.0.tar.gz (426.6 kB view details)

Uploaded Jun 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pagespeak-0.1.0-py3-none-any.whl (296.5 kB view details)

Uploaded Jun 17, 2026 Python 3

File details

Details for the file pagespeak-0.1.0.tar.gz.

File metadata

Download URL: pagespeak-0.1.0.tar.gz
Upload date: Jun 17, 2026
Size: 426.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pagespeak-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1b9aff08f5245b8be60d542c1925604d8800f6d798cfd1c14d047fc878951aef`
MD5	`492b7a8a51016384d551a6136d654da3`
BLAKE2b-256	`50afc9c50736ea34d21fc219d0b6982926195e9d95a65da89eb076f2b462ebe5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pagespeak-0.1.0.tar.gz:

Publisher: publish.yml on phierceweb/pagespeak

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pagespeak-0.1.0.tar.gz
- Subject digest: 1b9aff08f5245b8be60d542c1925604d8800f6d798cfd1c14d047fc878951aef
- Sigstore transparency entry: 1848521414
- Sigstore integration time: Jun 17, 2026
Source repository:
- Permalink: phierceweb/pagespeak@90d5abcbd78dba51b35b35e6ec907afb268a4774
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/phierceweb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@90d5abcbd78dba51b35b35e6ec907afb268a4774
- Trigger Event: push

File details

Details for the file pagespeak-0.1.0-py3-none-any.whl.

File metadata

Download URL: pagespeak-0.1.0-py3-none-any.whl
Upload date: Jun 17, 2026
Size: 296.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pagespeak-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7101e968626bb2ba9b8fc0ead50807b69d7d218d539a2ec76ffbd36c44429a9`
MD5	`dee963ca243d71345dbacc579fcb5f62`
BLAKE2b-256	`ccec6ac280bec63959b459399fc4e4990fa6ac7e4844ddf4ead2222667c9a4f2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pagespeak-0.1.0-py3-none-any.whl:

Publisher: publish.yml on phierceweb/pagespeak

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pagespeak-0.1.0-py3-none-any.whl
- Subject digest: b7101e968626bb2ba9b8fc0ead50807b69d7d218d539a2ec76ffbd36c44429a9
- Sigstore transparency entry: 1848521536
- Sigstore integration time: Jun 17, 2026
Source repository:
- Permalink: phierceweb/pagespeak@90d5abcbd78dba51b35b35e6ec907afb268a4774
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/phierceweb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@90d5abcbd78dba51b35b35e6ec907afb268a4774
- Trigger Event: push

pagespeak 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pagespeak

Why

Scope — where pagespeak fits

Install

Quickstart

Output shape

Vision backends

Format support

How it relates to other tools

Why a layer at all — heading fidelity

Docs

Security

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance