Skip to main content

Convert PDFs (especially Claude Web UI research exports) to Markdown with hyperlinks and visual structure preserved.

Project description

claude-pdf2md

Convert PDFs — especially Claude Web UI research exports — into Markdown with hyperlinks and visual structure preserved, and with a built-in visual diff to measure how close the rendered Markdown actually looks to the PDF.

Motivation

Claude Web UI's research feature produces two artefacts per report:

  • a PDF with working hyperlinks on every citation, and
  • a .md file with the same prose but all URLs stripped.

That asymmetry makes the .md output nearly useless for downstream work: quotes without sources, references without destinations, reports you can re-read but not verify. The PDF keeps the URLs as real PDF link annotations (rectangles drawn over the glyph run), so the information is present — it just isn't in the text layer.

claude-pdf2md rebuilds the Markdown from the PDF, not from Claude's textual export, and uses those link annotations as the source of truth for every citation.

What it does

End-to-end PDF → Markdown with:

  • 100% link recall via character-level overlay of the PDF's link annotation rectangles onto the glyph bounding boxes, so every [text](url) reflects what the PDF actually linked.
  • Structure detection: headings (by font-size percentile), bullet and numbered lists, paragraph reflow across wrapped lines, tables (fitz.find_tables), embedded images (dumped to an assets/ dir).
  • Citation-pill absorption: Claude's research PDFs render each citation as a small-font "pill" floating inline; these would otherwise break up the surrounding sentence. The pill blocks are detected by font-size and appended to the preceding paragraph.
  • Visual diff as a quality gate: each page of the source PDF is rendered to PNG, the generated Markdown is re-rendered to PDF via WeasyPrint and to PNG via PyMuPDF, and the two are compared with SSIM. Side-by-side diff images and a JSON report are written for manual inspection.

Installation

pip install claude-pdf2md            # core conversion
pip install 'claude-pdf2md[diff]'    # + WeasyPrint-based visual diff

Python 3.10+ required. Core depends on PyMuPDF, NumPy, Pillow. The [diff] extra adds markdown-it-py and weasyprint (which in turn needs cairo/pango on the host; see WeasyPrint's install notes).

CLI

claude-pdf2md input.pdf -o output.md --assets assets/

# with visual diff against the source pages
claude-pdf2md input.pdf -o output.md --assets assets/ --diff diff/

The --diff directory will contain one page_NNN.diff.png per compared page (source on the left, rendered Markdown on the right), plus a report.json:

{
  "pdf_pages": 15,
  "md_pages":  13,
  "compared":  13,
  "mean_ssim": 0.47,
  "pages": [ { "page": 1, "ssim": 0.47, "pixel_diff_ratio": 0.17, ... } ]
}

Python API

from claude_pdf2md import convert

md = convert(
    "report.pdf",
    output="report.md",
    assets_dir="assets",
)

How it works

The pipeline (one pass through the PDF, no external OCR):

  1. Extract. PyMuPDF is read with rawdict so every character arrives with its own bounding box, font, size, flags and colour.
  2. Link overlay. For each character, intersect its bbox with the page's get_links() rectangles; the winning rectangle (≥50% coverage) tags the character with a URL. The tag is carried into the span-merge step, so a sentence that includes a partial-word link like "See the report here for details" comes out with the link on the exact word, not the whole line.
  3. Merge characters to spans. Adjacent characters with identical (font, size, colour, flags, url) are coalesced; a gap wider than 0.6 × fontsize inserts a literal space.
  4. Structure. Body-text size is the char-weighted modal font size; sizes ≥ 1.10 × body become heading buckets (H1/H2/H3 by rank). List items are detected by their first-line prefix (1., , -, …). A continuation pass then joins wrapped lines that share the previous block's x-indent and have no heading/list marker. Citation pills (single-line blocks whose every URL-bearing span is ≥10% smaller than body) are folded into the previous paragraph.
  5. Tables & images. page.find_tables() regions become pipe-tables and consume the text blocks inside them; embedded images are written to assets/ and referenced with ![alt](path).
  6. Emit. Each block is rendered to Markdown, with consecutive same-URL spans collapsed into a single [...](url) and adjacent bold runs merged around whitespace so the output reads **new investigations** rather than **new** **investigations**.
  7. Visual diff. claude-pdf2md ... --diff renders both sides at the same page size and DPI, and reports per-page SSIM + pixel-diff ratio.

Repository layout

claude_pdf2md/
  __init__.py         # public API
  model.py            # BBox, Span, Line, Block, Page, Doc dataclasses
  extract.py          # PyMuPDF → model, char merge, linkage
  links.py            # link-rect → character tagging
  structure.py        # headings, lists, citation & continuation merging
  tables.py           # fitz.find_tables → Markdown tables
  images.py           # embedded-image extraction
  emit.py             # model → Markdown string
  converter.py        # pipeline orchestration
  cli.py              # `claude-pdf2md` entry point
  rendering.py        # MD → HTML → PDF → PNG, page side-by-side
  diff.py             # numpy-only SSIM + coarse pixel diff
tests/
  conftest.py         # synthetic-PDF fixture, optional Bulgaria Watch PDF
  test_links.py       # link recall, no spurious links, partial-word links
  test_structure.py   # heading + list detection
  test_emit.py        # table rendering, bold-run merging
  test_diff.py        # SSIM/pixel-diff sanity checks
  test_integration.py # end-to-end on the Bulgaria Watch research PDF

Limitations (v0.1)

  • Typographic fidelity is structural, not pixel-level. The diff uses a neutral serif font for the Markdown side, so SSIM against the original Georgia/Type3 PDF sits around 0.4–0.5 even when the content lines up correctly. Treat the SSIM score as "layout preserved" vs "layout broken", not "visual match".
  • No OCR. Scanned PDFs without a text layer produce empty Markdown.
  • Heading levels cap at H3 based on the three largest heading sizes in the document. Deeper hierarchies are flattened.
  • Footnotes / endnotes aren't split out into a footnote section; they appear inline at the position they occur.
  • Two-column layouts are read in PyMuPDF's default order, which is usually top-to-bottom-per-column but not guaranteed.

License

MIT — see LICENSE.

Note that PyMuPDF, this project's primary runtime dependency, is AGPL-3.0 (or commercial). Distributing claude-pdf2md together with PyMuPDF binaries subjects the combined distribution to AGPL obligations. The claude-pdf2md source itself remains MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claude_pdf2md-0.1.1.tar.gz (38.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

claude_pdf2md-0.1.1-py3-none-any.whl (24.8 kB view details)

Uploaded Python 3

File details

Details for the file claude_pdf2md-0.1.1.tar.gz.

File metadata

  • Download URL: claude_pdf2md-0.1.1.tar.gz
  • Upload date:
  • Size: 38.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for claude_pdf2md-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0c29354e365e97f84e15083d58dc39656ff5e3742bbdeff6a70dfe3b218f261d
MD5 21b15a8b73e3c987e2ca668ad52f3b43
BLAKE2b-256 90b96c70e87710add2d7b0a5c4ee9212850caffe6bef88bc54c9cdc2ce1e3573

See more details on using hashes here.

Provenance

The following attestation bundles were made for claude_pdf2md-0.1.1.tar.gz:

Publisher: release.yml on skippdot/claude-pdf2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file claude_pdf2md-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: claude_pdf2md-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 24.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for claude_pdf2md-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 749ad14fdf54d50982362696e6bdd056be049659be8f2267c7ec2adf2274bf03
MD5 eb42d489c202637b9bd099559a92cb6d
BLAKE2b-256 39ea869a83f9e37a0b3d38cfe66bf8f77f8b0c57e8518e3273743ae0d205f8b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for claude_pdf2md-0.1.1-py3-none-any.whl:

Publisher: release.yml on skippdot/claude-pdf2md

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page