Skip to main content

Web to Markdown. No garbage.

Project description

webtomd

Web to Markdown. No garbage.

Convert any URL into clean, structured Markdown. In seconds, from your terminal.

pip install webtomd
webtomd https://example.com/article

That's it. Clean .md file, saved to your current directory.

Who is this for?

  • Developers building RAG pipelines, training datasets, or knowledge bases from web content
  • Technical writers pulling reference material from docs, wikis, and blogs into Markdown
  • Researchers archiving web articles in a portable, version-controllable format
  • AI engineers feeding clean web content into LLM prompts without HTML noise
  • Anyone who's ever done "view source, copy, clean up for 20 minutes" and wished there was a better way

Why webtomd?

Most web-to-markdown tools give you a wall of nav links, cookie banners, and broken formatting. webtomd doesn't.

  • Actually clean output ➜ not just converted HTML, but intelligently extracted content with navs, sidebars, and ads stripped out
  • Works on real websites ➜ handles JS-rendered SPAs, paywalled layouts, complex tables, and nested lists
  • One command, zero config ➜ no browser extensions, no copy-paste, no manual cleanup
  • Plugs into your workflow ➜ pipes, batch files, stdin, clipboard, AI post-processing, all from the terminal
  • Your AI provider, your choice ➜ works with OpenAI, Anthropic, Gemini, Groq, or local Ollama

Works on Windows, macOS, and Linux. Python 3.11+.

Install

pip (all platforms)

pip install webtomd

uv (recommended, faster)

uv pip install webtomd

pipx (isolated global install)

pipx install webtomd

Optional extras

# AI provider support
pip install "webtomd[openai]"
pip install "webtomd[anthropic]"
pip install "webtomd[gemini]"
pip install "webtomd[groq]"
pip install "webtomd[ai-all]"

# JS-rendered page support (SPAs, React/Vue/Next.js sites)
pip install "webtomd[playwright]"
playwright install chromium

Verify installation

webtomd --help

If webtomd isn't found in your PATH, you can always run it as a module:

python -m webtomd --help

Features

  • Smart extraction: trafilatura + readability fallback chain with quality scoring
  • JS-rendered pages: optional Playwright fallback for SPAs
  • AI modes: summarize, translate, extract, Q&A via Anthropic / OpenAI / Gemini / Groq / Ollama
  • Batch processing: convert a file of URLs in one command with progress bar
  • CSS selectors: target specific page sections
  • YAML frontmatter: title, URL, date metadata
  • Auto-save: interactive terminals save files; piped runs output to stdout
  • Smart filenames: deterministic or AI-assisted naming
  • Clipboard: copy output with --copy
  • stdin support: pipe HTML directly
  • Recursive crawl: --depth N discovers and converts same-domain linked pages
  • Clean output: strips nav, sidebars, cookie banners, CSS noise, duplicate content
  • Cross-platform: Windows, macOS, Linux with encoding-safe output

Usage

Basic conversion

# Auto-saves .md file in interactive terminals
webtomd https://example.com/article

# Save to a specific file
webtomd https://example.com/article -o article.md

# Force output to terminal
webtomd https://example.com/article --stdout

Selectors and metadata

# Extract only content inside a CSS selector
webtomd https://example.com --selector "main"
webtomd https://example.com --selector "article .content"

# Add YAML frontmatter (title, url, date)
webtomd https://example.com --metadata

AI post-processing

webtomd https://example.com --ai summarize
webtomd https://example.com --ai "tl;dr"
webtomd https://example.com --ai translate
webtomd https://example.com --ai extract
webtomd https://example.com --ai qa

Batch and crawl

# Batch: convert a list of URLs
webtomd --batch urls.txt

# Crawl: recursively discover and convert same-domain links
webtomd https://example.com --depth 2

Stdin (pipe HTML directly)

macOS / Linux:

curl -s https://example.com | webtomd - --stdout
curl -s https://example.com | webtomd --stdout

Windows (PowerShell):

(Invoke-WebRequest https://example.com).Content | python -m webtomd - --stdout

Other options

# Copy result to clipboard
webtomd https://example.com --copy

# Open in default editor after saving
webtomd https://example.com --open

# Silent mode (no spinners, no preview, pipe-safe)
webtomd https://example.com --silent -o out.md

# Filename strategy
webtomd https://example.com --name-strategy deterministic
webtomd https://example.com --name-strategy ai

AI Setup

Set your provider's API key as an environment variable.

macOS / Linux (bash/zsh):

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GEMINI_API_KEY=...
export GROQ_API_KEY=gsk_...
export OLLAMA_HOST=http://localhost:11434

Windows (PowerShell):

$env:OPENAI_API_KEY = "sk-..."
$env:ANTHROPIC_API_KEY = "sk-ant-..."
$env:GEMINI_API_KEY = "..."
$env:GROQ_API_KEY = "gsk_..."
$env:OLLAMA_HOST = "http://localhost:11434"

Windows (Command Prompt):

set OPENAI_API_KEY=sk-...
set ANTHROPIC_API_KEY=sk-ant-...

To persist across sessions, add these to your shell profile (~/.bashrc, ~/.zshrc) or set them via Windows System Environment Variables.

Or use the interactive setup wizard (writes to ~/.webtomdrc):

webtomd --configure

The first available key is auto-detected in priority order: Anthropic > OpenAI > Gemini > Groq > Ollama.

If no key is configured, --ai modes gracefully fall back to plain Markdown output with a friendly message. Nothing breaks.

Configuration

Create ~/.webtomdrc (TOML format) for persistent defaults:

output_dir = "~/Documents/webtomd"
copy = false
metadata = false
silent = false
name_strategy = "deterministic"
ai_provider = "openai"

CLI flags always override config file values.

Location: ~/.webtomdrc resolves to:

  • macOS/Linux: /home/yourname/.webtomdrc
  • Windows: C:\Users\YourName\.webtomdrc

Batch Mode

Create a text file with one URL per line (# comments supported):

# My reading list
https://example.com/article-1
https://example.com/article-2
https://example.com/article-3
webtomd --batch urls.txt

Each URL is processed independently with a live progress bar. Failures don't abort the batch. A summary is printed at the end.

Output Defaults

Context Behavior
Interactive terminal Auto-saves .md file with generated name
Piped / non-interactive Prints Markdown to stdout
-o file.md Saves to the specified file
--stdout Forces stdout in any context

Troubleshooting

webtomd command not found:

  • Ensure your Python Scripts (Windows) or bin (macOS/Linux) directory is in your PATH
  • Alternative: python -m webtomd

Encoding errors on Windows:

  • webtomd handles UTF-8 output automatically, but if your terminal shows garbled characters, run chcp 65001 first or use Windows Terminal (recommended over cmd.exe)

Playwright not installing:

  • Run playwright install chromium after installing the playwright extra
  • On Linux, you may need system deps: playwright install-deps chromium

Clipboard not working:

  • macOS: works out of the box (pbcopy)
  • Linux: install xclip or xsel (sudo apt install xclip)
  • Windows: works out of the box

Slow conversion on certain sites:

  • Some sites throttle or block automated requests. This is network-bound, not a tool issue
  • Try --selector "main" to skip heavy page processing

Contributing

git clone https://github.com/MrRaccooon/WebToMD.git
cd WebToMD

Setup (all platforms):

pip install uv         # if you don't have uv
uv sync --extra dev
uv run pytest

Run lints:

uv run ruff check .

Run type checks:

uv run mypy webtomd/

License

GPL-3.0-or-later

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webtomd-0.1.2.tar.gz (41.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webtomd-0.1.2-py3-none-any.whl (46.5 kB view details)

Uploaded Python 3

File details

Details for the file webtomd-0.1.2.tar.gz.

File metadata

  • Download URL: webtomd-0.1.2.tar.gz
  • Upload date:
  • Size: 41.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.11 {"installer":{"name":"uv","version":"0.11.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for webtomd-0.1.2.tar.gz
Algorithm Hash digest
SHA256 05ec796a14d530ef08c2f427c5b9be9ac6fe16363c5a1cafab769f685b8d186b
MD5 aefdfebee1039ccbcd46966caa07ca76
BLAKE2b-256 f371426c2008bb6c2ec176f48fb7fe20350bee4799df1511275d9c0b60138d6d

See more details on using hashes here.

File details

Details for the file webtomd-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: webtomd-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 46.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.11 {"installer":{"name":"uv","version":"0.11.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for webtomd-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 607d678bff4c7822260ab6cbb8be30f554af18571143fe54db566a690ad0ddf4
MD5 0cf7b966a70b280bf65d43ab53d99216
BLAKE2b-256 9e24e674e07b571860096df655af565860a4e14c59faf83f1afa9d2fff9f5be4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page