Convert PDFs (especially Claude Web UI research exports) to Markdown with hyperlinks and visual structure preserved.
Project description
claude-pdf2md
Convert PDFs — especially Claude Web UI research exports — into Markdown with hyperlinks and visual structure preserved, and with a built-in visual diff to measure how close the rendered Markdown actually looks to the PDF.
Motivation
Claude Web UI's research feature produces two artefacts per report:
- a PDF with working hyperlinks on every citation, and
- a .md file with the same prose but all URLs stripped.
That asymmetry makes the .md output nearly useless for downstream work:
quotes without sources, references without destinations, reports you can
re-read but not verify. The PDF keeps the URLs as real PDF link annotations
(rectangles drawn over the glyph run), so the information is present — it just
isn't in the text layer.
claude-pdf2md rebuilds the Markdown from the PDF, not from Claude's
textual export, and uses those link annotations as the source of truth for
every citation.
What it does
End-to-end PDF → Markdown with:
- 100% link recall via character-level overlay of the PDF's link
annotation rectangles onto the glyph bounding boxes, so every
[text](url)reflects what the PDF actually linked. - Structure detection: headings (by font-size percentile), bullet and
numbered lists, paragraph reflow across wrapped lines, tables
(
fitz.find_tables), embedded images (dumped to anassets/dir). - Citation-pill absorption: Claude's research PDFs render each citation as a small-font "pill" floating inline; these would otherwise break up the surrounding sentence. The pill blocks are detected by font-size and appended to the preceding paragraph.
- Visual diff as a quality gate: each page of the source PDF is rendered to PNG, the generated Markdown is re-rendered to PDF via WeasyPrint and to PNG via PyMuPDF, and the two are compared with SSIM. Side-by-side diff images and a JSON report are written for manual inspection.
Installation
pip install claude-pdf2md # core conversion
pip install 'claude-pdf2md[diff]' # + WeasyPrint-based visual diff
Python 3.10+ required. Core depends on PyMuPDF, NumPy, Pillow. The [diff]
extra adds markdown-it-py and weasyprint (which in turn needs cairo/pango
on the host; see WeasyPrint's install notes).
Windows
# 1. Python 3.10+ from python.org (tick "Add Python to PATH" during install)
py -m venv .venv
.venv\Scripts\Activate.ps1
# 2. Core install — pure Python with pre-built wheels, no system deps
pip install claude-pdf2md
# 3. (optional) [diff] extra — needs GTK for WeasyPrint
# Install "GTK3 runtime" from https://github.com/tschoonj/GTK-for-Windows-Runtime-Environment-Installer/releases
# then:
pip install 'claude-pdf2md[diff]'
Verify:
claude-pdf2md --help
macOS / Linux
python3 -m venv .venv
source .venv/bin/activate
pip install claude-pdf2md # or 'claude-pdf2md[diff]'
On Linux, the [diff] extra additionally needs libpango-1.0-0,
libpangoft2-1.0-0, libharfbuzz0b, and fonts-dejavu (Debian/Ubuntu names;
see WeasyPrint's docs for other distributions).
CLI
claude-pdf2md input.pdf -o output.md --assets assets/
# with visual diff against the source pages
claude-pdf2md input.pdf -o output.md --assets assets/ --diff diff/
The --diff directory will contain one page_NNN.diff.png per compared page
(source on the left, rendered Markdown on the right), plus a report.json:
{
"pdf_pages": 15,
"md_pages": 13,
"compared": 13,
"mean_ssim": 0.47,
"pages": [ { "page": 1, "ssim": 0.47, "pixel_diff_ratio": 0.17, ... } ]
}
Python API
from claude_pdf2md import convert
md = convert(
"report.pdf",
output="report.md",
assets_dir="assets",
)
Plugins via enrichers=
As of 0.1.2, convert() / convert_to_string() accept an enrichers
list. Each enricher is a lightweight Protocol implementation:
class PageEnricher(Protocol):
def enrich(self, mu_page, page) -> None: ...
Enrichers run once per page right after text extraction and before tables /
images / structure / emit, so they can mutate page.blocks (add recognised
OCR lines, mark up signatures, drop boilerplate, …) and every downstream
pass treats the result exactly like native text.
The canonical use of this hook is claude-pdf2md-ocr,
which turns scanned PDFs into Markdown by feeding Tesseract output through
the enricher.
How it works
The pipeline (one pass through the PDF, no external OCR):
- Extract. PyMuPDF is read with
rawdictso every character arrives with its own bounding box, font, size, flags and colour. - Link overlay. For each character, intersect its bbox with the page's
get_links()rectangles; the winning rectangle (≥50% coverage) tags the character with a URL. The tag is carried into the span-merge step, so a sentence that includes a partial-word link like "See the report here for details" comes out with the link on the exact word, not the whole line. - Merge characters to spans. Adjacent characters with identical
(font, size, colour, flags, url)are coalesced; a gap wider than0.6 × fontsizeinserts a literal space. - Structure. Body-text size is the char-weighted modal font size;
sizes ≥
1.10 × bodybecome heading buckets (H1/H2/H3 by rank). List items are detected by their first-line prefix (1.,•,-, …). A continuation pass then joins wrapped lines that share the previous block's x-indent and have no heading/list marker. Citation pills (single-line blocks whose every URL-bearing span is ≥10% smaller than body) are folded into the previous paragraph. - Tables & images.
page.find_tables()regions become pipe-tables and consume the text blocks inside them; embedded images are written toassets/and referenced with. - Emit. Each block is rendered to Markdown, with consecutive same-URL
spans collapsed into a single
[...](url)and adjacent bold runs merged around whitespace so the output reads**new investigations**rather than**new** **investigations**. - Visual diff.
claude-pdf2md ... --diffrenders both sides at the same page size and DPI, and reports per-page SSIM + pixel-diff ratio.
Repository layout
claude_pdf2md/
__init__.py # public API
model.py # BBox, Span, Line, Block, Page, Doc dataclasses
extract.py # PyMuPDF → model, char merge, linkage
links.py # link-rect → character tagging
structure.py # headings, lists, citation & continuation merging
tables.py # fitz.find_tables → Markdown tables
images.py # embedded-image extraction
emit.py # model → Markdown string
converter.py # pipeline orchestration
cli.py # `claude-pdf2md` entry point
rendering.py # MD → HTML → PDF → PNG, page side-by-side
diff.py # numpy-only SSIM + coarse pixel diff
tests/
conftest.py # synthetic-PDF fixture, optional Bulgaria Watch PDF
test_links.py # link recall, no spurious links, partial-word links
test_structure.py # heading + list detection
test_emit.py # table rendering, bold-run merging
test_diff.py # SSIM/pixel-diff sanity checks
test_integration.py # end-to-end on the Bulgaria Watch research PDF
Limitations (v0.1)
- Typographic fidelity is structural, not pixel-level. The diff uses a neutral serif font for the Markdown side, so SSIM against the original Georgia/Type3 PDF sits around 0.4–0.5 even when the content lines up correctly. Treat the SSIM score as "layout preserved" vs "layout broken", not "visual match".
- No OCR. Scanned PDFs without a text layer produce empty Markdown.
- Heading levels cap at H3 based on the three largest heading sizes in the document. Deeper hierarchies are flattened.
- Footnotes / endnotes aren't split out into a footnote section; they appear inline at the position they occur.
- Two-column layouts are read in PyMuPDF's default order, which is usually top-to-bottom-per-column but not guaranteed.
License
MIT — see LICENSE.
Note that PyMuPDF, this project's primary runtime dependency, is AGPL-3.0 (or
commercial). Distributing claude-pdf2md together with PyMuPDF binaries
subjects the combined distribution to AGPL obligations. The
claude-pdf2md source itself remains MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file claude_pdf2md-0.1.2.tar.gz.
File metadata
- Download URL: claude_pdf2md-0.1.2.tar.gz
- Upload date:
- Size: 40.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
791d3fa9a3ac9cda8cb758f2f645b6d8282af547f5ea61a09225bc02f958e69b
|
|
| MD5 |
0e32c4406a87d04cab8a662282a21058
|
|
| BLAKE2b-256 |
447034ff7b0da95755136de8f05e24ed2e6f6f35eaaea12a132ae53827f09460
|
Provenance
The following attestation bundles were made for claude_pdf2md-0.1.2.tar.gz:
Publisher:
release.yml on skippdot/claude-pdf2md
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
claude_pdf2md-0.1.2.tar.gz -
Subject digest:
791d3fa9a3ac9cda8cb758f2f645b6d8282af547f5ea61a09225bc02f958e69b - Sigstore transparency entry: 1350167334
- Sigstore integration time:
-
Permalink:
skippdot/claude-pdf2md@d264bab2741b1159c7889213cd4d4fc0734925e3 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/skippdot
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d264bab2741b1159c7889213cd4d4fc0734925e3 -
Trigger Event:
push
-
Statement type:
File details
Details for the file claude_pdf2md-0.1.2-py3-none-any.whl.
File metadata
- Download URL: claude_pdf2md-0.1.2-py3-none-any.whl
- Upload date:
- Size: 26.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af8885b1dfa788d10c197730d235d5323a9f733a3226eaa45e406628788883aa
|
|
| MD5 |
15bb8fe8317c28f4341ea1b25b90105d
|
|
| BLAKE2b-256 |
abc576702ac6869641083d687830ec37259070813ed257c6756c6697e429d574
|
Provenance
The following attestation bundles were made for claude_pdf2md-0.1.2-py3-none-any.whl:
Publisher:
release.yml on skippdot/claude-pdf2md
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
claude_pdf2md-0.1.2-py3-none-any.whl -
Subject digest:
af8885b1dfa788d10c197730d235d5323a9f733a3226eaa45e406628788883aa - Sigstore transparency entry: 1350167455
- Sigstore integration time:
-
Permalink:
skippdot/claude-pdf2md@d264bab2741b1159c7889213cd4d4fc0734925e3 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/skippdot
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d264bab2741b1159c7889213cd4d4fc0734925e3 -
Trigger Event:
push
-
Statement type: