Skip to main content

Convert any webpage URL into clean, LLM-ready Markdown with frontmatter.

Project description

pagetomd

Convert any webpage URL into clean, LLM-ready Markdown with frontmatter.

PyPI version Python versions CI License Total Downloads

Why

  • AI-ready by default. Output is NFC-normalised UTF-8, single H1, monotonic heading hierarchy, no zero-width junk, no tracking chrome — drops straight into a vector store or LLM prompt.
  • Full-fidelity metadata. Every file ships with a YAML frontmatter block containing canonical URL, final URL after redirects, title, author, date, description, site name, language, and tool identity. No more "where did this Markdown come from?".
  • Static fast, JS-capable when needed. Default httpx fetcher is sub-second; opt-in playwright extra (or --fetcher auto) handles SPA shells without bloating the install for everyone else.
  • Stable, scriptable CLI. Typer-built, full env-var precedence (PAGETOMD_*), stable exit codes (0/2/3/4/5/64/130), structured logs (--log-json), and a --no-fetched-at switch for byte-deterministic output.
  • Not pandoc or curl + sed. pandoc doesn't fetch, doesn't strip boilerplate, and doesn't emit frontmatter. Hand-rolled curl | html2md pipelines re-invent extraction, mojibake handling, robots.txt, redirect caps, and atomic writes. pagetomd is one command for the whole pipeline.

Install

With pipx (recommended for CLI use)

pipx install pagetomd
# optional: enable JS rendering for SPAs
pipx inject pagetomd playwright && playwright install chromium

With uv

uv tool install pagetomd

With pip

pip install pagetomd                 # core
pip install 'pagetomd[playwright]'   # + SPA support

Quick start

# Default: derives output filename from the page title
pagetomd https://example.com/blog/post

# Stream to stdout (pipe into LLMs, etc.)
pagetomd https://example.com/blog/post -o -

# Deterministic output (omits fetched_at — good for snapshot tests / RAG ingestion)
pagetomd https://example.com/blog/post --no-fetched-at -o post.md

# Auto-detect SPA pages and fall back to headless Chromium
pagetomd https://my-spa.example.com -o - --fetcher auto

Cookbook

Pipe into an LLM

-o - writes the Markdown to stdout. All logs go to stderr, so the stream is safe to pipe:

pagetomd https://example.com/blog/post -o - | llm "summarise this article in five bullet points"

Batch-convert from a file

while read -r url; do
  pagetomd "$url"
done < urls.txt

Each successful conversion exits 0; any non-zero exit leaves the loop running but is visible in stderr (see Exit codes below).

Output shape

Running pagetomd http://127.0.0.1:8765/blog.html --no-fetched-at -o - against the blog.html fixture prints (first ~15 lines shown):

---
url: http://127.0.0.1:8765/blog.html
final_url: http://127.0.0.1:8765/blog.html
title: Why We Rewrote Our Build System in Rust
author: Jane Doe
date: '2024-08-14'
description: A retrospective on migrating our monorepo build pipeline from Python to Rust, and what we learned along the way.
site_name: Example Engineering Blog
language: en
tool: pagetomd
tool_version: 0.1.0
---

# Why We Rewrote Our Build System in Rust

Three years ago, our monorepo build pipeline was a sprawling Python application held together with shell scripts and prayer. ...

When fetched_at is enabled (the default), an extra fetched_at: '2026-06-15T12:34:56Z' line is included in the frontmatter. Fields whose value cannot be detected (e.g. language, author) are omitted from the YAML.

Common options

A compact overview — see pagetomd --help for the full list.

Flag Default Description
--output / -o derived from title Output path, or - for stdout.
--overwrite false Replace an existing destination file.
--follow-symlinks / --no-follow-symlinks false Allow writes to a symlinked destination. Off by default so --overwrite cannot be tricked into clobbering a file outside the intended directory via a symlink.
--fetcher httpx httpx, playwright, or auto.
--timeout 30.0 Per-request HTTP timeout (seconds).
--retries 3 Retry attempts on transient failures.
--user-agent pagetomd/<ver> Override the outbound User-Agent.
--no-verify-ssl false Disable TLS certificate verification (for corporate proxies that re-sign HTTPS).
--respect-robots / --no-respect-robots true Honour robots.txt (relaxed for loopback/RFC 1918).
--max-redirects 10 Cap on the redirect chain length.
--include-comments / --no-include-comments false Preserve HTML comments in the extracted document.
--include-images / --no-include-images true Keep image syntax in output.
--include-links / --no-include-links true Keep link URLs in output.
--heading-style atx atx (#) or setext (===).
--code-fences / --no-code-fences true Use fenced code blocks instead of indented ones.
--wide-tables kv Wide-table strategy: kv, html, or drop.
--no-fetched-at false Omit fetched_at for byte-deterministic output.
--log-level info debug, info, warning, error.
--log-json false Emit logs as JSON lines on stderr.
--debug false Shortcut for --log-level=debug + tracebacks on error.
--playwright-idle-ms 500 Extra wait (ms) after networkidle for late-firing scripts (Playwright fetcher only).
--version Print the installed version and exit.

Environment variables

Every flag has a PAGETOMD_<UPPER_NAME> equivalent. For example:

PAGETOMD_TIMEOUT=60 PAGETOMD_FETCHER=auto pagetomd https://example.com

CLI flags always override env vars; env vars override the built-in defaults.

Exit codes

Code Meaning
0 Success.
1 Unexpected internal error.
2 Fetch failure (DNS, HTTP, robots.txt, redirect cap).
3 Extraction or conversion failure (empty body, malformed HTML).
4 Output write failure (permissions, disk, atomic-rename clash).
5 Missing optional dependency (e.g. playwright not installed).
64 Usage or configuration error (bad flag, invalid value).
130 Interrupted by user (Ctrl-C).

How it works

One paragraph plus a diagram of the pipeline:

URL ──► Fetcher ──► Extractor ──► Converter ──► Postprocess ──► Writer
       (httpx /     (BS4 clean    (markdownify    (NFC, heading   (atomic
        playwright)  + trafilatura) + GFM tables)  hierarchy,      file +
                                                  URL absolutise)  YAML)

The fetcher (httpx by default, playwright for SPAs) downloads the page with retries and robots.txt enforcement. The extractor runs a BeautifulSoup pre-clean pass (strip scripts/styles/nav/ads) then hands the cleaned tree to trafilatura to identify main content and harvest metadata. The converter renders the surviving HTML to Markdown via a customised markdownify subclass (ATX headings, fenced code blocks with language hints, GFM tables with wide-table fallbacks). The postprocessor enforces the AI-readiness contract (NFC, zero-width strip, monotonic heading hierarchy, absolute URLs). The writer prepends a YAML frontmatter block and writes atomically (or streams to stdout).

Security

pagetomd is a public-URL-only tool. It refuses to fetch private, loopback, link-local, multicast, reserved, or cloud-metadata addresses by default — and there is no flag to override that. Treat output files as having the same sensitivity as the URL they were fetched from.

Quality gates

CI enforces both a project-wide test coverage floor of 85% and a per-module floor of 90% (line + branch combined) on the four critical modules — extractor, converter, writer, and postprocess. These four carry the AI-readiness contract, so they get the strictest coverage bar.

Contributing

git clone https://github.com/gs202/pagetomd.git
cd pagetomd
uv sync --extra dev --extra playwright
pre-commit install
uv run pytest

See CONTRIBUTING.md for the full contributor workflow.

License

Business Source License 1.1 — source-available, free for non-commercial use. Converts to MIT on 2030-06-16. See LICENSE for full terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pagetomd-0.1.0.tar.gz (47.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pagetomd-0.1.0-py3-none-any.whl (53.8 kB view details)

Uploaded Python 3

File details

Details for the file pagetomd-0.1.0.tar.gz.

File metadata

  • Download URL: pagetomd-0.1.0.tar.gz
  • Upload date:
  • Size: 47.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pagetomd-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ff254fa8281e7c419730c5190b9c9b032c4813350ba42d1f66843156eeebccf4
MD5 bab344aeff3504975a9cf45028b4133e
BLAKE2b-256 932f2e84d1502cb766bbb3d3f4730738c927dccaf573b3fb78c05da370edea9a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pagetomd-0.1.0.tar.gz:

Publisher: release.yml on gs202/PageToMD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pagetomd-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pagetomd-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 53.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pagetomd-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9f11459bcbc8816fc11db100b26cb27455cbea411598d7d45a7225adbb1e5480
MD5 17a612d4330a92b942fb755ed1a1022f
BLAKE2b-256 c26d89a1925f1217161cbfdd06cc5226bb7344a1ebebda610a3191aae43f71a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for pagetomd-0.1.0-py3-none-any.whl:

Publisher: release.yml on gs202/PageToMD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page