Web to Markdown. No garbage.
Project description
webtomd
Web to Markdown. No garbage.
Convert any URL into clean, structured Markdown. In seconds, from your terminal.
pip install webtomd
webtomd https://example.com/article
That's it. Clean .md file, saved to your current directory.
Who is this for?
- Developers building RAG pipelines, training datasets, or knowledge bases from web content
- Designers extracting design systems, color palettes, and typography from live sites
- Technical writers pulling reference material from docs, wikis, and blogs into Markdown
- Researchers archiving web articles in a portable, version-controllable format
- AI engineers feeding clean web content into LLM prompts without HTML noise
- Anyone who's ever done "view source, copy, clean up for 20 minutes" and wished there was a better way
Why webtomd?
Most web-to-markdown tools give you a wall of nav links, cookie banners, and broken formatting. webtomd doesn't.
- Actually clean output ➜ not just converted HTML, but intelligently extracted content with navs, sidebars, and ads stripped out
- Works on real websites ➜ handles JS-rendered SPAs (React, Next.js, Vue), complex tables, code blocks, and nested lists — Playwright kicks in automatically when static extraction fails
- Design system extraction ➜
--uipulls colors, typography, spacing, components from any site as markdown, JSON tokens, or HTML report - One command, zero config ➜ no browser extensions, no copy-paste, no manual cleanup
- Plugs into your workflow ➜ pipes, batch files, stdin, clipboard, AI post-processing, all from the terminal
- Your AI provider, your choice ➜ works with OpenAI, Anthropic, Gemini, Groq, or local Ollama
Works on Windows, macOS, and Linux. Python 3.11+.
Install
pip (all platforms)
pip install webtomd
uv (recommended, faster)
uv pip install webtomd
pipx (isolated global install)
pipx install webtomd
Optional extras
# JS-rendered pages + UI extraction (recommended)
pip install "webtomd[playwright]"
playwright install chromium
# AI provider support
pip install "webtomd[openai]"
pip install "webtomd[anthropic]"
pip install "webtomd[gemini]"
pip install "webtomd[groq]"
pip install "webtomd[ai-all]"
Verify installation
webtomd --help
If webtomd isn't found in your PATH, you can always run it as a module:
python -m webtomd --help
Features
- Smart extraction: trafilatura + readability + automatic Playwright fallback chain with quality scoring
- JS-rendered pages: headless Chromium triggers automatically when static extraction fails — React, Next.js, Vue, Stripe, Notion all work
- UI design extraction:
--uiextracts colors, typography, spacing, components, CSS variables, and layout patterns from any site - Multiple UI formats:
--format markdown(default),--format tokens(JSON design tokens),--format html(visual report) - AI modes: summarize, translate, extract, Q&A, design rationale via Anthropic / OpenAI / Gemini / Groq / Ollama
- Batch processing: convert a file of URLs in one command with progress bar
- Recursive crawl:
--depth Ndiscovers and converts same-domain linked pages - CSS selectors: target specific page sections
- YAML frontmatter: title, URL, date metadata
- Auto-save: interactive terminals save files; piped runs output to stdout
- Smart filenames: deterministic or AI-assisted naming
- Clipboard: copy output with
--copy - stdin support: pipe HTML directly
- Clean output: strips nav, sidebars, cookie banners, CSS noise, duplicate content — scopes to
<main>/<article>when present - Cross-platform: Windows, macOS, Linux with encoding-safe output
Usage
Basic conversion
# Auto-saves .md file in interactive terminals
webtomd https://example.com/article
# Save to a specific file
webtomd https://example.com/article -o article.md
# Force output to terminal
webtomd https://example.com/article --stdout
UI design extraction
# Extract design system as markdown style guide
webtomd https://stripe.com --ui
# Output as JSON design tokens (DTCG format)
webtomd https://linear.app --ui --format tokens
# Output as self-contained visual HTML report
webtomd https://vercel.com --ui --format html
# AI-enhanced with design rationale + accessibility recommendations
webtomd https://notion.so --ui --ai
Selectors and metadata
# Extract only content inside a CSS selector
webtomd https://example.com --selector "main"
webtomd https://example.com --selector "article .content"
# Add YAML frontmatter (title, url, date)
webtomd https://example.com --metadata
AI post-processing
webtomd https://example.com --ai summarize
webtomd https://example.com --ai "tl;dr"
webtomd https://example.com --ai translate
webtomd https://example.com --ai extract
webtomd https://example.com --ai qa
Batch and crawl
# Batch: convert a list of URLs
webtomd --batch urls.txt
# Crawl: recursively discover and convert same-domain links
webtomd https://example.com --depth 2
Stdin (pipe HTML directly)
macOS / Linux:
curl -s https://example.com | webtomd - --stdout
curl -s https://example.com | webtomd --stdout
Windows (PowerShell):
(Invoke-WebRequest https://example.com).Content | python -m webtomd - --stdout
Other options
# Copy result to clipboard
webtomd https://example.com --copy
# Open in default editor after saving
webtomd https://example.com --open
# Silent mode (no spinners, no preview, pipe-safe)
webtomd https://example.com --silent -o out.md
# Filename strategy
webtomd https://example.com --name-strategy deterministic
webtomd https://example.com --name-strategy ai
UI Extraction
The --ui flag uses Playwright to render a page and extract its complete design system:
| What it extracts | Details |
|---|---|
| Color palette | Grouped by role (dark/light/accent/transparent) with usage counts |
| Typography | Font stacks, sizes, weights, role labels (Display, H1-H6, Body, Caption) |
| Spacing scale | Auto-detected grid base (4px, 8px) with all spacing values |
| CSS variables | Categorized into color, spacing, typography, and other tokens |
| Components | Buttons (with variants), inputs, navigation links |
| Layout patterns | Flexbox/grid patterns, border radii, shadows |
| Design philosophy | Auto-detected theme, accent color, typeface count |
Three output formats:
webtomd https://stripe.com --ui # Markdown style guide (default)
webtomd https://stripe.com --ui --format tokens # JSON design tokens (DTCG)
webtomd https://stripe.com --ui --format html # Visual HTML report
Combine with --ai for design rationale, accessibility recommendations, and theme suggestions.
AI Setup
Set your provider's API key as an environment variable.
macOS / Linux (bash/zsh):
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GEMINI_API_KEY=...
export GROQ_API_KEY=gsk_...
export OLLAMA_HOST=http://localhost:11434
Windows (PowerShell):
$env:OPENAI_API_KEY = "sk-..."
$env:ANTHROPIC_API_KEY = "sk-ant-..."
$env:GEMINI_API_KEY = "..."
$env:GROQ_API_KEY = "gsk_..."
$env:OLLAMA_HOST = "http://localhost:11434"
Windows (Command Prompt):
set OPENAI_API_KEY=sk-...
set ANTHROPIC_API_KEY=sk-ant-...
To persist across sessions, add these to your shell profile (~/.bashrc, ~/.zshrc) or set them via Windows System Environment Variables.
Or use the interactive setup wizard (writes to ~/.webtomdrc):
webtomd --configure
The first available key is auto-detected in priority order: Anthropic > OpenAI > Gemini > Groq > Ollama.
If no key is configured, --ai modes gracefully fall back to plain Markdown output with a friendly message. Nothing breaks.
Configuration
Create ~/.webtomdrc (TOML format) for persistent defaults:
output_dir = "~/Documents/webtomd"
copy = false
metadata = false
silent = false
name_strategy = "deterministic"
ai_provider = "openai"
CLI flags always override config file values.
Location: ~/.webtomdrc resolves to:
- macOS/Linux:
/home/yourname/.webtomdrc - Windows:
C:\Users\YourName\.webtomdrc
Batch Mode
Create a text file with one URL per line (# comments supported):
# My reading list
https://example.com/article-1
https://example.com/article-2
https://example.com/article-3
webtomd --batch urls.txt
Each URL is processed independently with a live progress bar. Failures don't abort the batch. A summary is printed at the end.
Output Defaults
| Context | Behavior |
|---|---|
| Interactive terminal | Auto-saves .md file with generated name |
| Piped / non-interactive | Prints Markdown to stdout |
-o file.md |
Saves to the specified file |
--stdout |
Forces stdout in any context |
Troubleshooting
webtomd command not found:
- Ensure your Python
Scripts(Windows) orbin(macOS/Linux) directory is in your PATH - Alternative:
python -m webtomd
Encoding errors on Windows:
- webtomd handles UTF-8 output automatically, but if your terminal shows garbled characters, run
chcp 65001first or use Windows Terminal (recommended over cmd.exe)
Playwright not installing:
- Run
playwright install chromiumafter installing the playwright extra - On Linux, you may need system deps:
playwright install-deps chromium
Empty output on JS-heavy sites:
- Install Playwright:
pip install "webtomd[playwright]"thenplaywright install chromium - webtomd auto-detects when static extraction fails and falls back to headless Chromium
- Sites like Stripe, Notion, React.dev, Shopify all work with Playwright installed
Clipboard not working:
- macOS: works out of the box (
pbcopy) - Linux: install
xcliporxsel(sudo apt install xclip) - Windows: works out of the box
Slow conversion on certain sites:
- Some sites throttle or block automated requests. This is network-bound, not a tool issue
- Try
--selector "main"to skip heavy page processing
Contributing
git clone https://github.com/MrRaccooon/WebToMD.git
cd WebToMD
Setup (all platforms):
pip install uv # if you don't have uv
uv sync --extra dev
uv run pytest
Run lints:
uv run ruff check .
Run type checks:
uv run mypy webtomd/
License
GPL-3.0-or-later
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webtomd-0.1.3.tar.gz.
File metadata
- Download URL: webtomd-0.1.3.tar.gz
- Upload date:
- Size: 734.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
548c55c52ea549fd6e8cee0aa62209faa3a4732424f1c90f705903cecafc2d04
|
|
| MD5 |
ab5a23f2a2d7c6263b5226501e07f503
|
|
| BLAKE2b-256 |
d2691dcdc77d2dd870b701fc4ec099bb6bfff89171235c8dee41f1ce1d84853d
|
File details
Details for the file webtomd-0.1.3-py3-none-any.whl.
File metadata
- Download URL: webtomd-0.1.3-py3-none-any.whl
- Upload date:
- Size: 63.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8fbb5f6afae51ab6e683f691e43b86ea339f2d181e4d1dc0d78302cefc2dbbc4
|
|
| MD5 |
753a5feb1e16cde9375ceadebc32fcfa
|
|
| BLAKE2b-256 |
a11cd55a231c87644c2f1d1f6d66f9431fec0be960a9d3643bbc304f88973b62
|