Convert PDFs (especially Claude Web UI research exports) to Markdown with hyperlinks and visual structure preserved.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

claude-pdf2md

Convert PDFs — especially Claude Web UI research exports — into Markdown with hyperlinks and visual structure preserved, and with a built-in visual diff to measure how close the rendered Markdown actually looks to the PDF.

Motivation

Claude Web UI's research feature produces two artefacts per report:

a PDF with working hyperlinks on every citation, and
a .md file with the same prose but all URLs stripped.

That asymmetry makes the .md output nearly useless for downstream work: quotes without sources, references without destinations, reports you can re-read but not verify. The PDF keeps the URLs as real PDF link annotations (rectangles drawn over the glyph run), so the information is present — it just isn't in the text layer.

claude-pdf2md rebuilds the Markdown from the PDF, not from Claude's textual export, and uses those link annotations as the source of truth for every citation.

What it does

End-to-end PDF → Markdown with:

100% link recall via character-level overlay of the PDF's link annotation rectangles onto the glyph bounding boxes, so every [text](url) reflects what the PDF actually linked.
Structure detection: headings (by font-size percentile), bullet and numbered lists, paragraph reflow across wrapped lines, tables (fitz.find_tables), embedded images (dumped to an assets/ dir).
Citation-pill absorption: Claude's research PDFs render each citation as a small-font "pill" floating inline; these would otherwise break up the surrounding sentence. The pill blocks are detected by font-size and appended to the preceding paragraph.
Visual diff as a quality gate: each page of the source PDF is rendered to PNG, the generated Markdown is re-rendered to PDF via WeasyPrint and to PNG via PyMuPDF, and the two are compared with SSIM. Side-by-side diff images and a JSON report are written for manual inspection.

Installation

pip install claude-pdf2md            # core conversion
pip install 'claude-pdf2md[diff]'    # + WeasyPrint-based visual diff

Python 3.10+ required. Core depends on PyMuPDF, NumPy, Pillow. The [diff] extra adds markdown-it-py and weasyprint (which in turn needs cairo/pango on the host; see WeasyPrint's install notes).

Windows

# 1. Python 3.10+ from python.org (tick "Add Python to PATH" during install)
py -m venv .venv
.venv\Scripts\Activate.ps1

# 2. Core install — pure Python with pre-built wheels, no system deps
pip install claude-pdf2md

# 3. (optional) [diff] extra — needs GTK for WeasyPrint
#    Install "GTK3 runtime" from https://github.com/tschoonj/GTK-for-Windows-Runtime-Environment-Installer/releases
#    then:
pip install 'claude-pdf2md[diff]'

Verify:

claude-pdf2md --help

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate
pip install claude-pdf2md              # or 'claude-pdf2md[diff]'

On Linux, the [diff] extra additionally needs libpango-1.0-0, libpangoft2-1.0-0, libharfbuzz0b, and fonts-dejavu (Debian/Ubuntu names; see WeasyPrint's docs for other distributions).

CLI

claude-pdf2md input.pdf -o output.md --assets assets/

# with visual diff against the source pages
claude-pdf2md input.pdf -o output.md --assets assets/ --diff diff/

The --diff directory will contain one page_NNN.diff.png per compared page (source on the left, rendered Markdown on the right), plus a report.json:

{
  "pdf_pages": 15,
  "md_pages":  13,
  "compared":  13,
  "mean_ssim": 0.47,
  "pages": [ { "page": 1, "ssim": 0.47, "pixel_diff_ratio": 0.17, ... } ]
}

Python API

from claude_pdf2md import convert

md = convert(
    "report.pdf",
    output="report.md",
    assets_dir="assets",
)

Plugins via `enrichers=`

As of 0.1.2, convert() / convert_to_string() accept an enrichers list. Each enricher is a lightweight Protocol implementation:

class PageEnricher(Protocol):
    def enrich(self, mu_page, page) -> None: ...

Enrichers run once per page right after text extraction and before tables / images / structure / emit, so they can mutate page.blocks (add recognised OCR lines, mark up signatures, drop boilerplate, …) and every downstream pass treats the result exactly like native text.

The canonical use of this hook is claude-pdf2md-ocr, which turns scanned PDFs into Markdown by feeding Tesseract output through the enricher.

Structural validation

convert(validate=True) runs a post-pipeline sanity pass and emits a PdfStructureWarning for each issue it finds:

import warnings
from claude_pdf2md import PdfStructureWarning, convert

with warnings.catch_warnings(record=True) as caught:
    warnings.simplefilter("always")
    convert("scan.pdf", output="scan.md", validate=True)

for w in caught:
    if issubclass(w.category, PdfStructureWarning):
        print(w.message)

Current checks: empty page, image-only page (likely un-OCR'd scan — nudges toward claude-pdf2md-ocr), heading-level jumps (e.g. H1 → H3 with no H2 in between). Validation is opt-in and silent on a clean native PDF.

Alternatives

If you need PDF → Markdown but want a different set of trade-offs:

pdf2md-claude — runs the PDF through Claude's native-PDF API in ~10-page chunks with context carry-over, rebuilds tables with extended thinking, and reinjects figures from bounding boxes. Higher fidelity on complex layouts, but requires an Anthropic API key, costs per conversion, and doesn't work offline. claude-pdf2md is the opposite trade: local, deterministic, free, fast — and keeps 100% hyperlink recall by reading the PDF's link annotations directly rather than asking a model to re-discover them.

How it works

The pipeline (one pass through the PDF, no external OCR):

Extract. PyMuPDF is read with rawdict so every character arrives with its own bounding box, font, size, flags and colour.
Link overlay. For each character, intersect its bbox with the page's get_links() rectangles; the winning rectangle (≥50% coverage) tags the character with a URL. The tag is carried into the span-merge step, so a sentence that includes a partial-word link like "See the report here for details" comes out with the link on the exact word, not the whole line.
Merge characters to spans. Adjacent characters with identical (font, size, colour, flags, url) are coalesced; a gap wider than 0.6 × fontsize inserts a literal space.
Structure. Body-text size is the char-weighted modal font size; sizes ≥ 1.10 × body become heading buckets (H1/H2/H3 by rank). List items are detected by their first-line prefix (1., •, -, …). A continuation pass then joins wrapped lines that share the previous block's x-indent and have no heading/list marker. Citation pills (single-line blocks whose every URL-bearing span is ≥10% smaller than body) are folded into the previous paragraph.
Tables & images. page.find_tables() regions become pipe-tables and consume the text blocks inside them; embedded images are written to assets/ and referenced with ![alt](path).
Emit. Each block is rendered to Markdown, with consecutive same-URL spans collapsed into a single [...](url) and adjacent bold runs merged around whitespace so the output reads **new investigations** rather than **new** **investigations**.
Visual diff. claude-pdf2md ... --diff renders both sides at the same page size and DPI, and reports per-page SSIM + pixel-diff ratio.

Repository layout

claude_pdf2md/
  __init__.py         # public API
  model.py            # BBox, Span, Line, Block, Page, Doc dataclasses
  extract.py          # PyMuPDF → model, char merge, linkage
  links.py            # link-rect → character tagging
  structure.py        # headings, lists, citation & continuation merging
  tables.py           # fitz.find_tables → Markdown tables
  images.py           # embedded-image extraction
  emit.py             # model → Markdown string
  converter.py        # pipeline orchestration
  cli.py              # `claude-pdf2md` entry point
  rendering.py        # MD → HTML → PDF → PNG, page side-by-side
  diff.py             # numpy-only SSIM + coarse pixel diff
tests/
  conftest.py         # synthetic-PDF fixture, optional Bulgaria Watch PDF
  test_links.py       # link recall, no spurious links, partial-word links
  test_structure.py   # heading + list detection
  test_emit.py        # table rendering, bold-run merging
  test_diff.py        # SSIM/pixel-diff sanity checks
  test_integration.py # end-to-end on the Bulgaria Watch research PDF

Limitations (v0.1)

Typographic fidelity is structural, not pixel-level. The diff uses a neutral serif font for the Markdown side, so SSIM against the original Georgia/Type3 PDF sits around 0.4–0.5 even when the content lines up correctly. Treat the SSIM score as "layout preserved" vs "layout broken", not "visual match".
No OCR. Scanned PDFs without a text layer produce empty Markdown.
Heading levels cap at H3 based on the three largest heading sizes in the document. Deeper hierarchies are flattened.
Footnotes / endnotes aren't split out into a footnote section; they appear inline at the position they occur.
Two-column layouts are read in PyMuPDF's default order, which is usually top-to-bottom-per-column but not guaranteed.

License

MIT — see LICENSE.

Note that PyMuPDF, this project's primary runtime dependency, is AGPL-3.0 (or commercial). Distributing claude-pdf2md together with PyMuPDF binaries subjects the combined distribution to AGPL obligations. The claude-pdf2md source itself remains MIT.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

skippdot

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

Apr 21, 2026

0.1.2

Apr 21, 2026

0.1.1

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claude_pdf2md-0.1.3.tar.gz (43.7 kB view details)

Uploaded Apr 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

claude_pdf2md-0.1.3-py3-none-any.whl (28.6 kB view details)

Uploaded Apr 21, 2026 Python 3

File details

Details for the file claude_pdf2md-0.1.3.tar.gz.

File metadata

Download URL: claude_pdf2md-0.1.3.tar.gz
Upload date: Apr 21, 2026
Size: 43.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for claude_pdf2md-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`5d42d10bdaf4ae47f72cd1cfd4aa06d7a9a82eea54794a2ab887ef103e8ec576`
MD5	`0f6fea111a4f8264b3e6b2e4a0a8ffbf`
BLAKE2b-256	`a3bf8f0d4f26b25afbffaee854f859c472a7c65c38e6cea74383ad12bbcb61c4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for claude_pdf2md-0.1.3.tar.gz:

Publisher: release.yml on skippdot/claude-pdf2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: claude_pdf2md-0.1.3.tar.gz
- Subject digest: 5d42d10bdaf4ae47f72cd1cfd4aa06d7a9a82eea54794a2ab887ef103e8ec576
- Sigstore transparency entry: 1352567582
- Sigstore integration time: Apr 21, 2026
Source repository:
- Permalink: skippdot/claude-pdf2md@ceff586f7095393f3b0dd431aac72db8d5f23820
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/skippdot
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ceff586f7095393f3b0dd431aac72db8d5f23820
- Trigger Event: push

File details

Details for the file claude_pdf2md-0.1.3-py3-none-any.whl.

File metadata

Download URL: claude_pdf2md-0.1.3-py3-none-any.whl
Upload date: Apr 21, 2026
Size: 28.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for claude_pdf2md-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1502db5f2aa9d68b52f96c91166be2ce65bfd05e52a2c27522601d8c05b01144`
MD5	`8edef9ca213d2564aeb3f666286d86cd`
BLAKE2b-256	`18f05eb5d86909143e974e83b2de83f5ca9e350b2eff1db9138e04151f6fc44d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for claude_pdf2md-0.1.3-py3-none-any.whl:

Publisher: release.yml on skippdot/claude-pdf2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: claude_pdf2md-0.1.3-py3-none-any.whl
- Subject digest: 1502db5f2aa9d68b52f96c91166be2ce65bfd05e52a2c27522601d8c05b01144
- Sigstore transparency entry: 1352567701
- Sigstore integration time: Apr 21, 2026
Source repository:
- Permalink: skippdot/claude-pdf2md@ceff586f7095393f3b0dd431aac72db8d5f23820
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/skippdot
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ceff586f7095393f3b0dd431aac72db8d5f23820
- Trigger Event: push

claude-pdf2md 0.1.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

claude-pdf2md

Motivation

What it does

Installation

Windows

macOS / Linux

CLI

Python API

Plugins via enrichers=

Structural validation

Alternatives

How it works

Repository layout

Limitations (v0.1)

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Plugins via `enrichers=`