Convert any webpage URL into clean, LLM-ready Markdown with frontmatter.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gs202

These details have not been verified by PyPI

Project description

pagetomd

Convert any webpage URL into clean, LLM-ready Markdown with frontmatter.

Why

AI-ready by default. Output is NFC-normalised UTF-8, single H1, monotonic heading hierarchy, no zero-width junk, no tracking chrome — drops straight into a vector store or LLM prompt.
Full-fidelity metadata. Every file ships with a YAML frontmatter block containing canonical URL, final URL after redirects, title, author, date, description, site name, language, and tool identity. No more "where did this Markdown come from?".
Static fast, JS-capable when needed. Default httpx fetcher is sub-second; opt-in playwright extra (or --fetcher auto) handles SPA shells without bloating the install for everyone else.
Stable, scriptable CLI. Typer-built, full env-var precedence (PAGETOMD_*), stable exit codes (0/2/3/4/5/64/130), structured logs (--log-json), and a --no-fetched-at switch for byte-deterministic output.
Not pandoc or curl + sed. pandoc doesn't fetch, doesn't strip boilerplate, and doesn't emit frontmatter. Hand-rolled curl | html2md pipelines re-invent extraction, mojibake handling, robots.txt, redirect caps, and atomic writes. pagetomd is one command for the whole pipeline.

Install

With pipx (recommended for CLI use)

pipx install pagetomd
# optional: enable JS rendering for SPAs
pipx inject pagetomd playwright && playwright install chromium

With uv

uv tool install pagetomd

With pip

pip install pagetomd                 # core
pip install 'pagetomd[playwright]'   # + SPA support

Quick start

# Default: derives output filename from the page title
pagetomd https://example.com/blog/post

# Stream to stdout (pipe into LLMs, etc.)
pagetomd https://example.com/blog/post -o -

# Deterministic output (omits fetched_at — good for snapshot tests / RAG ingestion)
pagetomd https://example.com/blog/post --no-fetched-at -o post.md

# Auto-detect SPA pages and fall back to headless Chromium
pagetomd https://my-spa.example.com -o - --fetcher auto

Cookbook

Pipe into an LLM

-o - writes the Markdown to stdout. All logs go to stderr, so the stream is safe to pipe:

pagetomd https://example.com/blog/post -o - | llm "summarise this article in five bullet points"

Batch-convert from a file

while read -r url; do
  pagetomd "$url"
done < urls.txt

Each successful conversion exits 0; any non-zero exit leaves the loop running but is visible in stderr (see Exit codes below).

Output shape

Running pagetomd http://127.0.0.1:8765/blog.html --no-fetched-at -o - against the blog.html fixture prints (first ~15 lines shown):

---
url: http://127.0.0.1:8765/blog.html
final_url: http://127.0.0.1:8765/blog.html
title: Why We Rewrote Our Build System in Rust
author: Jane Doe
date: '2024-08-14'
description: A retrospective on migrating our monorepo build pipeline from Python to Rust, and what we learned along the way.
site_name: Example Engineering Blog
language: en
tool: pagetomd
tool_version: 0.1.0
---

# Why We Rewrote Our Build System in Rust

Three years ago, our monorepo build pipeline was a sprawling Python application held together with shell scripts and prayer. ...

When fetched_at is enabled (the default), an extra fetched_at: '2026-06-15T12:34:56Z' line is included in the frontmatter. Fields whose value cannot be detected (e.g. language, author) are omitted from the YAML.

Common options

A compact overview — see pagetomd --help for the full list.

Flag	Default	Description
`--output / -o`	derived from title	Output path, or `-` for stdout.
`--overwrite`	`false`	Replace an existing destination file.
`--follow-symlinks / --no-follow-symlinks`	`false`	Allow writes to a symlinked destination. Off by default so `--overwrite` cannot be tricked into clobbering a file outside the intended directory via a symlink.
`--fetcher`	`httpx`	`httpx`, `playwright`, or `auto`.
`--timeout`	`30.0`	Per-request HTTP timeout (seconds).
`--retries`	`3`	Retry attempts on transient failures.
`--user-agent`	`pagetomd/<ver>`	Override the outbound `User-Agent`.
`--no-verify-ssl`	`false`	Disable TLS certificate verification (for corporate proxies that re-sign HTTPS).
`--respect-robots / --no-respect-robots`	`true`	Honour `robots.txt` (relaxed for loopback/RFC 1918).
`--max-redirects`	`10`	Cap on the redirect chain length.
`--include-comments / --no-include-comments`	`false`	Preserve HTML comments in the extracted document.
`--include-images / --no-include-images`	`true`	Keep image syntax in output.
`--include-links / --no-include-links`	`true`	Keep link URLs in output.
`--heading-style`	`atx`	`atx` (`#`) or `setext` (`===`).
`--code-fences / --no-code-fences`	`true`	Use fenced code blocks instead of indented ones.
`--wide-tables`	`kv`	Wide-table strategy: `kv`, `html`, or `drop`.
`--no-fetched-at`	`false`	Omit `fetched_at` for byte-deterministic output.
`--log-level`	`info`	`debug`, `info`, `warning`, `error`.
`--log-json`	`false`	Emit logs as JSON lines on stderr.
`--debug`	`false`	Shortcut for `--log-level=debug` + tracebacks on error.
`--playwright-idle-ms`	`500`	Extra wait (ms) after networkidle for late-firing scripts (Playwright fetcher only).
`--version`	—	Print the installed version and exit.

Environment variables

Every flag has a PAGETOMD_<UPPER_NAME> equivalent. For example:

PAGETOMD_TIMEOUT=60 PAGETOMD_FETCHER=auto pagetomd https://example.com

CLI flags always override env vars; env vars override the built-in defaults.

Exit codes

Code	Meaning
`0`	Success.
`1`	Unexpected internal error.
`2`	Fetch failure (DNS, HTTP, robots.txt, redirect cap).
`3`	Extraction or conversion failure (empty body, malformed HTML).
`4`	Output write failure (permissions, disk, atomic-rename clash).
`5`	Missing optional dependency (e.g. `playwright` not installed).
`64`	Usage or configuration error (bad flag, invalid value).
`130`	Interrupted by user (Ctrl-C).

How it works

One paragraph plus a diagram of the pipeline:

URL ──► Fetcher ──► Extractor ──► Converter ──► Postprocess ──► Writer
       (httpx /     (BS4 clean    (markdownify    (NFC, heading   (atomic
        playwright)  + trafilatura) + GFM tables)  hierarchy,      file +
                                                  URL absolutise)  YAML)

The fetcher (httpx by default, playwright for SPAs) downloads the page with retries and robots.txt enforcement. The extractor runs a BeautifulSoup pre-clean pass (strip scripts/styles/nav/ads) then hands the cleaned tree to trafilatura to identify main content and harvest metadata. The converter renders the surviving HTML to Markdown via a customised markdownify subclass (ATX headings, fenced code blocks with language hints, GFM tables with wide-table fallbacks). The postprocessor enforces the AI-readiness contract (NFC, zero-width strip, monotonic heading hierarchy, absolute URLs). The writer prepends a YAML frontmatter block and writes atomically (or streams to stdout).

Security

pagetomd is a public-URL-only tool. It refuses to fetch private, loopback, link-local, multicast, reserved, or cloud-metadata addresses by default — and there is no flag to override that. Treat output files as having the same sensitivity as the URL they were fetched from.

Quality gates

CI enforces both a project-wide test coverage floor of 85% and a per-module floor of 90% (line + branch combined) on the four critical modules — extractor, converter, writer, and postprocess. These four carry the AI-readiness contract, so they get the strictest coverage bar.

Contributing

git clone https://github.com/gs202/pagetomd.git
cd pagetomd
uv sync --extra dev --extra playwright
pre-commit install
uv run pytest

See CONTRIBUTING.md for the full contributor workflow.

License

Business Source License 1.1 — source-available, free for non-commercial use. Converts to MIT on 2030-06-16. See LICENSE for full terms.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gs202

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Jun 17, 2026

This version

0.1.0

Jun 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pagetomd-0.1.0.tar.gz (47.9 kB view details)

Uploaded Jun 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pagetomd-0.1.0-py3-none-any.whl (53.8 kB view details)

Uploaded Jun 17, 2026 Python 3

File details

Details for the file pagetomd-0.1.0.tar.gz.

File metadata

Download URL: pagetomd-0.1.0.tar.gz
Upload date: Jun 17, 2026
Size: 47.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pagetomd-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ff254fa8281e7c419730c5190b9c9b032c4813350ba42d1f66843156eeebccf4`
MD5	`bab344aeff3504975a9cf45028b4133e`
BLAKE2b-256	`932f2e84d1502cb766bbb3d3f4730738c927dccaf573b3fb78c05da370edea9a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pagetomd-0.1.0.tar.gz:

Publisher: release.yml on gs202/PageToMD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pagetomd-0.1.0.tar.gz
- Subject digest: ff254fa8281e7c419730c5190b9c9b032c4813350ba42d1f66843156eeebccf4
- Sigstore transparency entry: 1846017228
- Sigstore integration time: Jun 17, 2026
Source repository:
- Permalink: gs202/PageToMD@eb28859559ff79687c96693455c23a058c985ffa
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/gs202
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@eb28859559ff79687c96693455c23a058c985ffa
- Trigger Event: push

File details

Details for the file pagetomd-0.1.0-py3-none-any.whl.

File metadata

Download URL: pagetomd-0.1.0-py3-none-any.whl
Upload date: Jun 17, 2026
Size: 53.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pagetomd-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9f11459bcbc8816fc11db100b26cb27455cbea411598d7d45a7225adbb1e5480`
MD5	`17a612d4330a92b942fb755ed1a1022f`
BLAKE2b-256	`c26d89a1925f1217161cbfdd06cc5226bb7344a1ebebda610a3191aae43f71a7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pagetomd-0.1.0-py3-none-any.whl:

Publisher: release.yml on gs202/PageToMD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pagetomd-0.1.0-py3-none-any.whl
- Subject digest: 9f11459bcbc8816fc11db100b26cb27455cbea411598d7d45a7225adbb1e5480
- Sigstore transparency entry: 1846017522
- Sigstore integration time: Jun 17, 2026
Source repository:
- Permalink: gs202/PageToMD@eb28859559ff79687c96693455c23a058c985ffa
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/gs202
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@eb28859559ff79687c96693455c23a058c985ffa
- Trigger Event: push

pagetomd 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pagetomd

Why

Install

With pipx (recommended for CLI use)

With uv

With pip

Quick start

Cookbook

Pipe into an LLM

Batch-convert from a file

Output shape

Common options

Environment variables

Exit codes

How it works

Security

Quality gates

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance