Convert any webpage URL into clean, LLM-ready Markdown with frontmatter.
Project description
pagetomd
Convert any webpage URL into clean, LLM-ready Markdown with frontmatter.
Why
- AI-ready by default. Output is NFC-normalised UTF-8, single H1, monotonic heading hierarchy, no zero-width junk, no tracking chrome — drops straight into a vector store or LLM prompt.
- Full-fidelity metadata. Every file ships with a YAML frontmatter block containing canonical URL, final URL after redirects, title, author, date, description, site name, language, and tool identity. No more "where did this Markdown come from?".
- Static fast, JS-capable when needed. Default
httpxfetcher is sub-second; opt-inplaywrightextra (or--fetcher auto) handles SPA shells without bloating the install for everyone else. - Stable, scriptable CLI. Typer-built, full env-var precedence (
PAGETOMD_*), stable exit codes (0/2/3/4/5/64/130), structured logs (--log-json), and a--no-fetched-atswitch for byte-deterministic output. - Not
pandocorcurl + sed.pandocdoesn't fetch, doesn't strip boilerplate, and doesn't emit frontmatter. Hand-rolledcurl | html2mdpipelines re-invent extraction, mojibake handling, robots.txt, redirect caps, and atomic writes.pagetomdis one command for the whole pipeline.
Install
With pipx (recommended for CLI use)
pipx install pagetomd
# optional: enable JS rendering for SPAs
pipx inject pagetomd playwright && playwright install chromium
With uv
uv tool install pagetomd
With pip
pip install pagetomd # core
pip install 'pagetomd[playwright]' # + SPA support
Quick start
# Default: derives output filename from the page title
pagetomd https://example.com/blog/post
# Stream to stdout (pipe into LLMs, etc.)
pagetomd https://example.com/blog/post -o -
# Deterministic output (omits fetched_at — good for snapshot tests / RAG ingestion)
pagetomd https://example.com/blog/post --no-fetched-at -o post.md
# Auto-detect SPA pages and fall back to headless Chromium
pagetomd https://my-spa.example.com -o - --fetcher auto
Cookbook
Pipe into an LLM
-o - writes the Markdown to stdout. All logs go to stderr, so the stream is safe to pipe:
pagetomd https://example.com/blog/post -o - | llm "summarise this article in five bullet points"
Batch-convert from a file
while read -r url; do
pagetomd "$url"
done < urls.txt
Each successful conversion exits 0; any non-zero exit leaves the loop
running but is visible in stderr (see Exit codes below).
Output shape
Running pagetomd http://127.0.0.1:8765/blog.html --no-fetched-at -o - against the blog.html fixture prints (first ~15 lines shown):
---
url: http://127.0.0.1:8765/blog.html
final_url: http://127.0.0.1:8765/blog.html
title: Why We Rewrote Our Build System in Rust
author: Jane Doe
date: '2024-08-14'
description: A retrospective on migrating our monorepo build pipeline from Python to Rust, and what we learned along the way.
site_name: Example Engineering Blog
language: en
tool: pagetomd
tool_version: 0.1.0
---
# Why We Rewrote Our Build System in Rust
Three years ago, our monorepo build pipeline was a sprawling Python application held together with shell scripts and prayer. ...
When fetched_at is enabled (the default), an extra fetched_at: '2026-06-15T12:34:56Z' line is included in the frontmatter. Fields whose value cannot be detected (e.g. language, author) are omitted from the YAML.
Common options
A compact overview — see pagetomd --help for the full list.
| Flag | Default | Description |
|---|---|---|
--output / -o |
derived from title | Output path, or - for stdout. |
--overwrite |
false |
Replace an existing destination file. |
--follow-symlinks / --no-follow-symlinks |
false |
Allow writes to a symlinked destination. Off by default so --overwrite cannot be tricked into clobbering a file outside the intended directory via a symlink. |
--fetcher |
httpx |
httpx, playwright, or auto. |
--timeout |
30.0 |
Per-request HTTP timeout (seconds). |
--retries |
3 |
Retry attempts on transient failures. |
--user-agent |
pagetomd/<ver> |
Override the outbound User-Agent. |
--no-verify-ssl |
false |
Disable TLS certificate verification (for corporate proxies that re-sign HTTPS). |
--respect-robots / --no-respect-robots |
true |
Honour robots.txt (relaxed for loopback/RFC 1918). |
--max-redirects |
10 |
Cap on the redirect chain length. |
--include-comments / --no-include-comments |
false |
Preserve HTML comments in the extracted document. |
--include-images / --no-include-images |
true |
Keep image syntax in output. |
--include-links / --no-include-links |
true |
Keep link URLs in output. |
--heading-style |
atx |
atx (#) or setext (===). |
--code-fences / --no-code-fences |
true |
Use fenced code blocks instead of indented ones. |
--wide-tables |
kv |
Wide-table strategy: kv, html, or drop. |
--no-fetched-at |
false |
Omit fetched_at for byte-deterministic output. |
--log-level |
info |
debug, info, warning, error. |
--log-json |
false |
Emit logs as JSON lines on stderr. |
--debug |
false |
Shortcut for --log-level=debug + tracebacks on error. |
--playwright-idle-ms |
500 |
Extra wait (ms) after networkidle for late-firing scripts (Playwright fetcher only). |
--version |
— | Print the installed version and exit. |
Environment variables
Every flag has a PAGETOMD_<UPPER_NAME> equivalent. For example:
PAGETOMD_TIMEOUT=60 PAGETOMD_FETCHER=auto pagetomd https://example.com
CLI flags always override env vars; env vars override the built-in defaults.
Exit codes
| Code | Meaning |
|---|---|
0 |
Success. |
1 |
Unexpected internal error. |
2 |
Fetch failure (DNS, HTTP, robots.txt, redirect cap). |
3 |
Extraction or conversion failure (empty body, malformed HTML). |
4 |
Output write failure (permissions, disk, atomic-rename clash). |
5 |
Missing optional dependency (e.g. playwright not installed). |
64 |
Usage or configuration error (bad flag, invalid value). |
130 |
Interrupted by user (Ctrl-C). |
How it works
One paragraph plus a diagram of the pipeline:
URL ──► Fetcher ──► Extractor ──► Converter ──► Postprocess ──► Writer
(httpx / (BS4 clean (markdownify (NFC, heading (atomic
playwright) + trafilatura) + GFM tables) hierarchy, file +
URL absolutise) YAML)
The fetcher (httpx by default, playwright for SPAs) downloads the page with retries and robots.txt enforcement. The extractor runs a BeautifulSoup pre-clean pass (strip scripts/styles/nav/ads) then hands the cleaned tree to trafilatura to identify main content and harvest metadata. The converter renders the surviving HTML to Markdown via a customised markdownify subclass (ATX headings, fenced code blocks with language hints, GFM tables with wide-table fallbacks). The postprocessor enforces the AI-readiness contract (NFC, zero-width strip, monotonic heading hierarchy, absolute URLs). The writer prepends a YAML frontmatter block and writes atomically (or streams to stdout).
Security
pagetomd is a public-URL-only tool. It refuses to fetch private, loopback, link-local, multicast, reserved, or cloud-metadata addresses by default — and there is no flag to override that. Treat output files as having the same sensitivity as the URL they were fetched from.
Quality gates
CI enforces both a project-wide test coverage floor of 85% and a per-module floor of 90% (line + branch combined) on the four critical modules — extractor, converter, writer, and postprocess. These four carry the AI-readiness contract, so they get the strictest coverage bar.
Contributing
git clone https://github.com/gs202/pagetomd.git
cd pagetomd
uv sync --extra dev --extra playwright
pre-commit install
uv run pytest
See CONTRIBUTING.md for the full contributor workflow.
License
Business Source License 1.1 — source-available, free for non-commercial use. Converts to MIT on 2030-06-16. See LICENSE for full terms.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pagetomd-0.1.0.tar.gz.
File metadata
- Download URL: pagetomd-0.1.0.tar.gz
- Upload date:
- Size: 47.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff254fa8281e7c419730c5190b9c9b032c4813350ba42d1f66843156eeebccf4
|
|
| MD5 |
bab344aeff3504975a9cf45028b4133e
|
|
| BLAKE2b-256 |
932f2e84d1502cb766bbb3d3f4730738c927dccaf573b3fb78c05da370edea9a
|
Provenance
The following attestation bundles were made for pagetomd-0.1.0.tar.gz:
Publisher:
release.yml on gs202/PageToMD
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pagetomd-0.1.0.tar.gz -
Subject digest:
ff254fa8281e7c419730c5190b9c9b032c4813350ba42d1f66843156eeebccf4 - Sigstore transparency entry: 1846017228
- Sigstore integration time:
-
Permalink:
gs202/PageToMD@eb28859559ff79687c96693455c23a058c985ffa -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/gs202
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@eb28859559ff79687c96693455c23a058c985ffa -
Trigger Event:
push
-
Statement type:
File details
Details for the file pagetomd-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pagetomd-0.1.0-py3-none-any.whl
- Upload date:
- Size: 53.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f11459bcbc8816fc11db100b26cb27455cbea411598d7d45a7225adbb1e5480
|
|
| MD5 |
17a612d4330a92b942fb755ed1a1022f
|
|
| BLAKE2b-256 |
c26d89a1925f1217161cbfdd06cc5226bb7344a1ebebda610a3191aae43f71a7
|
Provenance
The following attestation bundles were made for pagetomd-0.1.0-py3-none-any.whl:
Publisher:
release.yml on gs202/PageToMD
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pagetomd-0.1.0-py3-none-any.whl -
Subject digest:
9f11459bcbc8816fc11db100b26cb27455cbea411598d7d45a7225adbb1e5480 - Sigstore transparency entry: 1846017522
- Sigstore integration time:
-
Permalink:
gs202/PageToMD@eb28859559ff79687c96693455c23a058c985ffa -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/gs202
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@eb28859559ff79687c96693455c23a058c985ffa -
Trigger Event:
push
-
Statement type: